loading page

Large Language Models are Extremely Bad at Creating Anagrams
  • Michael King
Michael King
Vanderbilt University

Corresponding Author:[email protected]

Author Profile


Much has been made of the remarkable abilities of generative artificial intelligence (AI) and large language models (LLMs) such as ChatGPT, and their ability to rapidly produce convincingly human-like text on demand. I set out to evaluate the ability of three popular LLMs (ChatGPT version GPT-4; ChatGPT version GPT-3.5; Google Bard) to construct anagrams, that is, pairs of words or phrases that use all of the same letters exactly once, rearranged into new meaningful words and phrases. These models have been previously demonstrated to perform well on text-based portions of general intelligence tests, successfully solving various word and mathematical puzzles in a manner that resembles elements of human general intelligence. Surprisingly, all three LLMs performed quite badly when prompted to generate anagrams related to a specific theme, succeeding in only 2.5% of anagram attempts overall, with only the GPT-4 version of ChatGPT producing any valid anagrams at all. All three LLMs successfully returned the correct definition of an anagram, along with providing one or more valid examples when queried. In summary, the failure of current LLMs to generate anagrams related to a specific theme provides a curious example of a “cognitive blind spot” in the performance of these otherwise impressive tools.