An important paper from a team of Apple researchers was published in the last few days, and I believe it could impact the current discussions about artificial intelligence:
The article entitled GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models , explores in depth the capacity of various language models in reasoning tasks, trying to answer the question:
Are LLMs really capable of reasoning?
The latest iterations of LLMs that feature Chain of Thought Reasoning (such as OpenAI o1 and Claude 3.5 Sonnet), also known as Reasoning Models, are really impressive, and we are using these models for increasingly complex tasks - even moving towards allowing models to get very close to decision making.
Since almost all of these models are presented by tool vendors, it is natural that there is some skepticism about the metrics they have been presenting. If we also factor in the hype around the products (especially OpenAI's O1), it becomes difficult to discern what is purely marketing and what is actually real about these models.
The Apple research team proposes a question generation benchmark to better assess how LLMs are capable of solving problems. And basically they found a large variance in results when the same question is asked in different ways .
In particular, the performance of all models declines when only the numerical values are changed following the GSM-Symbolic benchmark. Furthermore, we investigate the weakness of the mathematical reasoning in these models and demonstrate that the performance deteriorates significantly as the number of clauses in the question increases. We hypothesize that this decline is based on the fact that LLMs are not capable of genuinely reasoning logically; instead, these models try to replicate the reasoning steps presented in the training dataset.
How does GSM Symbolic assessment work? Can LLMs reason?
In general, the assessment of the models' capabilities is done through the writing of mathematical statements and the interpretation of the results.
In the table below, you can see an example of the GSM8k benchmark dataset , and its evaluation strategy, on the left side.
On the right side, the new template for generating statements, which uses GSM8k as support, allowing greater variation, and consequently more scope in the researchers' analysis.
Template for generating the benchmark dataset
All models break with simple changes to the problem, they do not interpret irrelevant information and are much less capable of reaching conclusions that do not exist on the net or in their training dataset. This is very important to understand about the conclusions of this article:
LLMs are heavily influenced by small variations in the questions, and are not reliable tools for critical answers. If you currently rely on your GPT instance to make decisions, I suggest you rethink that decision. Simply changing the names of a prompt affects the results by about 10%!
Small changes, such as names and values in the prompt, can completely change the output, indicating that these models are driven by tokens, rather than formal reasoning, which has been said for a long time: LLMs do probabilistic pattern-matching: if there is similar information in their training base, they will respond by joining the dots. These models search for the closest data set in their training base, without any "understanding" of concepts.
Basic changes to the questions used in the training dataset yield impressive results. Adding irrelevant information to the question prompt breaks all models:
But how does this happen?
The current main reasoning test benchmark is the GSM8K dataset , which includes over 8000 school math tests, with questions containing the 4 arithmetic operations.
Although quite relevant, the hypothesis is that the fact that this dataset is public may have contaminated the result, and therefore any simple change in the prompt can significantly alter the results. The model does not "know" how to solve your prompt, it has the correct answer already recorded.
In the end, LLMs seem like magic, but only because the models have memorized THE ENTIRE Internet, and are able to search that content for something similar to what you are looking for, and you may be putting a lot more trust in the tool than it should.
What I found most interesting about the article is that all LLMS are extremely fragile when the prompt incorporates irrelevant information, negatively impacting the result. Even the most recent models, considered to be one step away from [[AGI]], make the same mistakes:
Some sentences that are irrelevant to the reasoning of the problem are added and affect the conclusion drastically. Most models fail to recognize the irrelevance of the sentences.
How can this help you?
Stating the obvious:
Do not entrust risky actions to chatGPT
And more than that: any perspective that your company may have of using these models as decision-making increases the risk to the business.
Whether you’re writing code, determining stocks to buy, or planning your strategy, these models are not reliable, despite being very useful!
Performance changes completely when proper nouns are changed in the prompts for math problems (it's not Mary who has 4 oranges, it's Joe who has 5 lemons and it completely affects the solution)
Anything new that needs to be produced can be supported by LLMs, but never directed by them.
Now, this article has a very relevant lesson for those who frequently use LLMs for work - distinguishing between reasoning and copying will be an important skill for the future, including to evaluate whether an idea from Claude is really viable or not. In the end, this article is very useful for you to improve your prompt skills, since I believe it is inevitable that LLMs will be incorporated into all knowledge activities.
For a more in-depth explanation, I suggest this talk:
Comments