When AI thinks before it speaks

Marina Pantcheva Director of Linguistic AI Services 04 Feb 2025

7 mins

According to my parents, as a child, I would ask hundreds of questions a day. I’d go on question-generating sprees, firing off a barrage of ‘What for?’, ‘Where from?’, ‘How come?’, and ‘Who else?’ But whenever I asked ‘Why?’, my father, a scientist, would stop me and clarify: did I mean a ‘mechanistic why’ or a ‘teleological why’?

The first ‘why’ is scientific: it aims to uncover the causes and mechanisms behind a phenomenon. The second ‘why’ is philosophical: it explores a phenomenon’s purpose and ultimate meaning. As an aspiring little ‘want-to-be-an-astrophysicist’, I chose to always ask the first type of ‘why.’

Fast forward to January 2025, and I found myself once again asking hundred times a day my favourite mechanistic ‘Why?’ But this time, my questions weren’t targeted at my father. Instead, I was asking AI.

The rise of reasoning AI

The release of DeepSeek’s R1 model last week set off the race to improve reasoning AI models. Amidst the geopolitical storm, the stock value crash, and the legal debates over copyright, one important aspect faded into the background: as reasoning models become ubiquitous, we can finally ask ‘Why do AI models respond the way they do?’ to try and understand the reasoning process behind their answers.

This realization came to me during an exchange with Renato Beninatto on LinkedIn. Renato tested DeepSeek’s R1 and OpenAI’s GPT-o1 on a tricky translation. From the two models, only GPT-o1 managed to resolve the translation issue. The first question that came to my mind was: Why did GPT-o1 get it right? Was it just luck, or did it genuinely reason its way to the correct translation?

The power of ‘Why?’

This shift is monumental. Older AI models performed translation tasks using their implicit knowledge of linguistic structures, relying on vast amounts of pre-trained data to predict the next most likely word. They couldn’t articulate why they made certain choices.

Now, we can ask AI models not just to translate but to explain their translation decisions. We can look under the hood and gain insights into their reasoning process: whether they arrived at the correct translation through logic or through sheer luck and whether their mistakes stemmed from flawed reasoning or from a lack of knowledge.

From stochastic parroting to logical thinking

Older AI models functioned like intuitive System 1 thinkers (Kahneman, Daniel. 2011. Thinking, fast and slow). System 1 thinking is fast, automatic and effortless. It allows humans to instantly recognize a face, complete a common phrase, or react to a sudden sound. The older, non-reasoning AI models (like GPT-3) processed prompts in this manner: without thinking, just providing responses based on statistical patterns rather than a deliberate logical analysis.

Reasoning AI models like DeepSeek’s R1, Gemini’s 1.5 Pro and OpenAI’s GPT-o1 operate more like System 2 thinkers. System 2 thinking is slow, deliberate and logical. It activates when solving a math problem, planning a strategy, or analyzing complex scenarios. Reasoning AI models don’t just predict the most likely next word based on probability. They actively reason through problems, weigh different interpretations and explain their conclusions. This shift is akin to a person pausing to think critically before answering a question rather than relying on intuition alone.

How does reasoning emerge in AI?

Instead of activating all artificial neurons in their neural network simultaneously, reasoning AI models use a Mixture of Experts (MoE) approach. This means that they delegate specific tasks to specialized subnetworks, much like the human brain uses different brain regions to perform different functions.

For example, during a conversation, the human brain employs a division of labor among various brain regions:

The Wernicke area handles language comprehension (understanding the partner’s question).
The temporoparietal junction assesses intent (why did my partner ask this? What information do they need?).
The amygdala processes emotions (e.g., recognizing that the partner needs help).
The prefrontal cortex engages in logical reasoning and constructs a meaningful response.
The Broca area is responsible for producing the answer as a coherent and well-formed linguistic sequence.

Additionally, numerous other brain regions activate to support auditory perception, vision, articulation, memory and other functions. Similarly, MoE-based AI models selectively activate expert subnetworks, optimizing reasoning and decision-making processes, effectively mimicking the reflective thinking process of humans.

Putting AI to test: The intuitive vs. the deliberative translator

The transition from probabilistic AI models to reasoning AI models is like switching from a translator who writes down the first thing that comes to mind to one who takes the time to analyze the source text, examine the context, consult subject matter experts and only then produce a translation.

Inspired by my discussions with Renato Beninatto (Nimdzi Insights) and Vincent Gadani (Microsoft), I decided to test various AI models using a tricky translation of my own. I chose the segment empty folder for translation into German. It is a classic example of an ambiguous segment that requires contextual understanding to be translated accurately. As an UI element, it has the correct translation Ordner leeren (a noun followed by a verb in infinitive form). When handling such a segment, a good translator must take into consideration the following factors:

Syntactic ambiguity

The phrase could be interpreted as:

a descriptive phrase (adjective + noun): a folder that is empty
a call for action (verb + noun): perform the action of emptying the folder

Semantic ambiguity

The adjective ‘empty’ is polysemous. It can denote:
- a folder that has no content
- a folder that has been emptied (Some languages make an explicit distinction between these two meanings.)
The noun ‘folder’ is ambiguous between:
- a physical folder (e.g., a binder)
- a digital file directory (e.g., a computer folder)

Naturally, the translation must also adhere to the established format of UI segments as per Style Guide and general practice.

Tested models

I prompted six models to translate the segment. The selection includes two reasoning models (DeepSeek R1 and OpenAI GPT-o1) and their corresponding non-reasoning counterparts (DeepSeek V3 and OpenAI GPT-4o). This setup allowed for an assessment of how Mixture-of-Experts (MoE) architectures influence reasoning capabilities in translation tasks.

Given that DeepSeek's models may have been partially developed through the distillation of OpenAI's models, I also included Google’s two reasoning models in the test group to assess the performance of independently created MoE models.

DeepSeek V3: a non-reasoning instruction-tuned model, positioned as a competitor to OpenAI’s GPT-4o. It follows a probabilistic approach, generating translations based on statistical likelihood rather than explicit logical reasoning.
DeepSeek R1: a reasoning AI model that made headlines last week. Built on top of V3 base model it is said to perform on par with OpenAI’s GPT-o1, excelling in structured reasoning tasks. Unlike V3, it can engage in deliberate problem-solving rather than simply predicting the next most probable token.
GPT-4o: OpenAI’s flagship multimodal model, capable of processing text, vision and audio. It is known for its fast inference speed and improved contextual understanding, but it remains a generalist AI, meaning it may or may not apply structured reasoning to translation tasks without explicit prompting.
OpenAI GPT-o1: OpenAI’s first reasoning AI model, designed to spend more time on thinking before responding. Unlike previous probabilistic models, GPT-o1 aims to follow a logical chain of thought, making it an ideal candidate for handling ambiguity in translation.
Google Gemini 1.5 Flash: A Mixture-of-Experts (MoE) model optimized for efficiency and speed. While Flash models are designed for quick, lightweight tasks, they might lack the deep reasoning ability of their Pro counterparts.
Google Gemini 1.5 Pro: Google’s most powerful MoE-based reasoning model. Unlike Flash, it is designed for complex, multi-step tasks, making it a strong competitor to GPT-o1 and DeepSeek R1 in handling ambiguous translations.

Importantly, R1 reveals its thinking step (which makes for a fascinating reading!), while GPT-o1 and Gemini’s 1.5 Pro provide only a summarized version of the thinking process.

Testing prompts

I used two prompts in the same chat session. The first prompt tested whether the model considers all possible interpretations of the segment, including semantic and syntactic ambiguity. The desired result was that the model attributes a ‘call-to-action’ parsing (verb + noun) to the segment. In such case, I also checked whether the translation followed the widely accepted German UI translation standard, rendering it as a noun followed by the verb's infinitive form.

The second prompt explicitly instructed the model to translate the segment as a UI element.

The results

The results from each model and prompt are summarized in the table below.

	DeepSeek V3	DeepSeek R1	OpenAI GPT-4o	OpenAI GPT-o1	Gemini 2.0 Flash	Gemini 1.5 Pro
First prompt: Translate into German ‘empty folder’. Consider all possible interpretations of the phrase caused by semantic or syntactic ambiguity.
Polysemy of ‘empty’ considered	yes	yes	no	yes	yes	yes
Ambiguity of ‘folder’ considered	yes	yes	no	yes	no	no
Syntactic ambiguity considered	no	no	yes	yes	no	no
Adheres to standards for translating UI elements	*	*	no	yes	*	*
Follow-up prompt: Translate ‘empty folder’ as a name of a UI button.
Correctly identifies the phrase as call for action (verb + noun)	yes	no	yes	yes	yes	yes
Adheres to standards for translating UI elements	yes	no	yes	yes	yes	yes

OpenAI’s GPT-o1 was the only model to capture all possible nuances correctly. This doesn’t mean it’s the best model available—just that it handled this specific single example the best (see footnote). Its reasoning was focused and concise, reaching the correct translation in just a few seconds.

Interestingly, the base model it builds upon, GPT-4o, detected only syntactic ambiguities and missed the semantic ambiguities. However, both models correctly translated the segment when explicitly instructed it was a UI element.

Gemini’s models, surprisingly, showed little sensitivity to ambiguity. They detected the polysemy of the adjective empty but failed to recognize the different types of folders (digital vs. physical) or the syntactic ambiguity. When asked to translate the segment as a UI element, both models provided the standard German translation: a noun followed by an infinitive.

DeepSeek’s models identified the lexical ambiguity of empty and folder but missed the syntactic ambiguity. The most interesting behavior emerged when explicitly prompted to translate the segment as a UI element. The base instruction model, V3, produced the correct translation. However, the reasoning model, R1, initially considered a call-to-action parsing before rejecting the noun+verb translation. Misled by the prompt’s wording (stating that the segment was the name of a UI element), R1 insisted on a strictly nominal translation.

I've observed in other experiments, too, that DeepSeek R1 has the tendency to assume that the user prompt is both factually accurate – which is not always the case – and logically coherent (which is even less common given that humans are famous for their ability to hold multiple contradictory beliefs at once). These assumptions can cause the model to enter infinite reasoning loops, repeatedly cycling through the same logical patterns for minutes before resolving the contradiction between its conclusions and the prompt. This negatively impacts latency, as the model spends significantly more time on a single segment than a human translator ever would.

From alchemy to chemistry: A new era for prompt engineering

Observing how reasoning models ‘think’ is fascinating not just as a curiosity but for its practical implications. It provides insight into their thought processes, including incorrect interpretations of prompts and leaps in logic caused by poorly defined user requirements, misunderstandings and missing context.

This information is invaluable for prompt engineers. They can use it to refine prompts and improve the models’ reasoning. By pinpointing where models go wrong, prompt engineers can strategically guide AI toward the correct reasoning path.

With reasoning models, then, prompt engineering is finally evolving from alchemy to chemistry.

This test is not meant to be a definitive assessment of the models' overall performance. It is merely a toy example designed to offer a glimpse into how different AI models handle reasoning. No large-scale conclusions should be drawn from this experiment. For meaningful insights, the test must be performed using rigorous research methods, controlled prompts and extensive test data sets.

Marina Pantcheva

Director of Linguistic AI Services

Marina is Director of Linguistic AI Services at RWS. Marina holds a PhD degree in Theoretical Linguistics and is passionate about languages, data, technology and exploring new areas of knowledge. She leads a high-performing team of linguistic AI professionals and researchers who develop linguistic AI solutions as part of RWS' tech and service offering.

All from Marina Pantcheva