Wired: Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

Source URL: https://arstechnica.com/ai/2024/10/llms-cant-perform-genuine-logical-reasoning-apple-researchers-suggest/
Source: Wired
Title: Apple Engineers Show How Flimsy AI ‘Reasoning’ Can Be

Feedly Summary: The new frontier in large language models is the ability to “reason” their way through problems. New research from Apple says it’s not quite what it’s cracked up to be.

AI Summary and Description: Yes

Summary: The study by Apple engineers reveals critical vulnerabilities in the mathematical reasoning abilities of advanced Large Language Models (LLMs) like those from OpenAI and Google. It highlights that these models rely on probabilistic pattern matching rather than genuine logical reasoning, leading to significant performance drops when subjected to minor modifications in benchmark problems.

Detailed Description:
The recent study titled “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models” undertaken by a team of engineers from Apple exposes fundamental shortcomings in the reasoning capabilities of state-of-the-art LLMs. They investigated how small changes to the input data could dramatically skew the output accuracy of these models, reinforcing the notion that current LLMs lack a formal grasp of logical reasoning processes.

Key Insights:
– **Reasoning Limitations**: The LLMs’ reliance on probabilistic pattern matching is shown to lead to brittleness in mathematical reasoning, indicating that these models are not equipped for authentic logical reasoning.

– **Benchmark Variations**: The researchers used the GSM8K benchmark, a well-known dataset for evaluating mathematical reasoning in LLMs. Upon creating a modified dataset named GSM-Symbolic, they maintained the same mathematical challenges but altered names and numbers to prevent “data contamination.” This led to notable decreases in the models’ accuracy, implying their output is heavily influenced by the input context rather than the underlying mathematical logic.

– **Performance Variability**: Across over 20 different LLMs tested, performance dropped by 0.3% to 9.2%, with some models showing volatility of accuracy up to 15% between runs even with consistent question structures.

– **Introducing Irrelevant Information**: An additional benchmark, “GSM-NoOp”, introduced irrelevant details into problems, resulting in catastrophic drops in accuracy that ranged from 17.5% to 65.7%. This stark contrast in results suggests that LLMs struggle with distinguishing pertinent information from noise, further evidencing the limits of their pattern matching approach.

– **Implications for AI Development**: The findings raise critical questions about how AI researchers and developers should approach the training and evaluation of AI systems concerning logical reasoning capabilities. Focusing solely on improving accuracy without addressing the model’s foundational reasoning processes could lead to unreliable AI applications, particularly in fields that require sound decision-making.

In conclusion, this study underscores the necessity for enhanced methodologies in developing AI systems that not only excel in pattern recognition but also grasp the concepts necessary for robust reasoning capabilities, critical for ensuring their reliability in practical applications across industries.