Administration of the text-based portions of a general IQ test to five different large language models

Michael King

doi:10.36227/techrxiv.22645561.v1

loading page

Administration of the text-based portions of a general IQ test to five different large language models

Michael King

Abstract

As additional large language model (LLM) AI chatbots become publicly available, there is growing interest in their capacity for general intelligence, and what differences in intelligence these various models might exhibit. One challenge in assessing general intelligence using a standard intelligence quotient (IQ) test is that a large fraction of the questions in such tests is visual, in particular the “spatial” portions that present patterns and sequences in drawn images, and numerical questions where the spatial arrangement of numbers is important. In this study, the author distilled down the text-based portions of two self-scoring IQ tests and administered these questions to five different publicly available large language models: ChatGPT (Default GPT-3.5 version), ChatGPT (Legacy GPT-3.5 version), ChatGPT (GPT-4 version), Microsoft Bing chatbot (also based on the GPT-4 LLM, however linked to live internet search), and Google Bard, which is based on the LaMBDA LLM. The test scores were converted into a range of approximate IQ values for each LLM with the following median values determined: 112, 111.5, 123, 121.5, and 101, respectively. Of particular interest is that all five LLMs performed exceptionally well in certain question types, and particularly poorly in other question types, suggesting that LLMs share common strengths and weaknesses in particular aspects of general intelligence. The highest performing LLM publicly available to date, the GPT-4 version of ChatGPT Plus, shows performance on the test-based portions of a general IQ test which approach the 99th percentile of human performance, within the range of MENSA level of general intelligence. These models are expected to continue to improve over time, based on the differences seen over versions released in the past year, and will soon be capable of taking intact IQ tests that rely on interpretation of graphical images.