Administration of the text-based portions of a general IQ test to five
different large language models
Abstract
As additional large language model (LLM) AI chatbots become publicly
available, there is growing interest in their capacity for general
intelligence, and what differences in intelligence these various models
might exhibit. One challenge in assessing general intelligence using a
standard intelligence quotient (IQ) test is that a large fraction of the
questions in such tests is visual, in particular the “spatial”
portions that present patterns and sequences in drawn images, and
numerical questions where the spatial arrangement of numbers is
important. In this study, the author distilled down the text-based
portions of two self-scoring IQ tests and administered these questions
to five different publicly available large language models: ChatGPT
(Default GPT-3.5 version), ChatGPT (Legacy GPT-3.5 version), ChatGPT
(GPT-4 version), Microsoft Bing chatbot (also based on the GPT-4 LLM,
however linked to live internet search), and Google Bard, which is based
on the LaMBDA LLM. The test scores were converted into a range of
approximate IQ values for each LLM with the following median values
determined: 112, 111.5, 123, 121.5, and 101, respectively. Of particular
interest is that all five LLMs performed exceptionally well in certain
question types, and particularly poorly in other question types,
suggesting that LLMs share common strengths and weaknesses in particular
aspects of general intelligence. The highest performing LLM publicly
available to date, the GPT-4 version of ChatGPT Plus, shows performance
on the test-based portions of a general IQ test which approach the 99th
percentile of human performance, within the range of MENSA level of
general intelligence. These models are expected to continue to improve
over time, based on the differences seen over versions released in the
past year, and will soon be capable of taking intact IQ tests that rely
on interpretation of graphical images.