The Cognitive Transition in Large Language Models: From Probabilistic Pattern Matching to Deliberative System 2 Reasoning

1. Introduction: The Reasoning Frontier

The trajectory of Large Language Model (LLM) development has shifted decisively from the pursuit of parameter scale (“Pre-training Scaling Laws”) to the optimization of reasoning capabilities through computational depth (“Inference-Time Scaling Laws”). Historically, LLMs operated on a paradigm of autoregressive next-token prediction, a mechanism frequently analogized to “System 1” cognition in humans—fast, intuitive, and heuristic-driven. While this architecture yielded unprecedented capabilities in fluency and knowledge retrieval, it exhibited fundamental fragility in complex problem-solving tasks requiring multi-step logic, backtracking, and error correction.1

The current research landscape, spanning 2024 and 2025, is defined by the quest to engineer “System 2” capabilities: slow, deliberative, and logical reasoning processes. This transition is not merely an incremental improvement but a restructuring of how models approach computation. It involves three distinct but converging vectors of innovation: Advanced Structured Prompting, which imposes topological constraints on the model’s output; Inference-Time Compute Scaling, which treats reasoning as a search problem over a latent state space; and Architectural Modifications, which integrate persistent memory, recurrence, and neuro-symbolic logic directly into the model’s substrate.4

The implications of this shift are profound. We are moving from a “token economy,” where utility is measured by output speed and length, to a “compute economy,” where utility is a function of the energy and time invested in finding the correct answer. This report provides an exhaustive technical analysis of these methodologies, synthesizing empirical data from recent breakthroughs like the DeepSeek-R1 training pipeline, the Q* heuristic search framework, and the emergence of “thinking tokens” as information-theoretic control nodes.

2. Structured Prompting and Topological Reasoning Frameworks

The initial mechanism for eliciting reasoning from LLMs was the “Chain-of-Thought” (CoT) prompt, which demonstrated that generating intermediate steps could unlock emergent capabilities. However, the linearity of standard CoT—proceeding sequentially from premise to conclusion—has proven insufficient for complex tasks that require exploring multiple hypotheses or synthesizing non-linear information. The field has thus advanced toward “Topological Reasoning,” where the structure of the prompt mirrors the geometry of the problem space.1

2.1 Beyond Linearity: Tree of Thoughts (ToT)

The Tree of Thoughts (ToT) framework represents the first formal departure from linear reasoning. ToT conceptualizes the reasoning process as a search over a tree structure, where each node represents a “thought”—a coherent intermediate step toward solving a problem.7 Unlike CoT, which commits to a single path, ToT enables the model to perform “lookahead” (simulating the consequences of a thought) and “backtracking” (abandoning a path that yields a low evaluation score).

In a typical ToT implementation, the LLM acts as both the “Generator” (proposing new child nodes) and the “Evaluator” (scoring nodes based on their promise). An external controller algorithm, such as Breadth-First Search (BFS) or Depth-First Search (DFS), manages the exploration of this tree. For instance, in the “Game of 24” benchmark, ToT allows the model to explore different arithmetic combinations, backtrack when a branch leads to an impossible state, and converge on a solution that a linear pass would likely miss.7

However, the efficacy of ToT comes with a high latency cost. The requirement for an external controller to query the LLM repeatedly for node generation and evaluation creates a bottleneck. ToT is computationally intensive, often requiring hundreds of model calls to solve a single problem, making it impractical for real-time applications despite its high accuracy.1

2.2 Networked Reasoning: Graph of Thoughts (GoT)

While trees allow for branching, they enforce a strict hierarchy that prevents the synthesis of disparate ideas. The Graph of Thoughts (GoT) framework generalizes the topology further by modeling reasoning as a Directed Acyclic Graph (DAG) or even cyclic graphs.6 GoT introduces operations that are topologically impossible in ToT:

  • Aggregation: Fusing multiple independent thought chains into a unified node. This is critical for tasks like summarization or multi-document synthesis, where the model must combine insights from Branch A and Branch B.
  • Refinement Loops: Cycles within the graph where a specific node is iteratively improved until it meets a quality threshold.
  • Split-Merge Patterns: Breaking a complex problem into sub-problems (nodes), solving them in parallel branches, and merging the solutions.

Empirical evaluations demonstrate that GoT outperforms ToT in sorting tasks (quality increase of 62%) and creative writing, while reducing costs by over 31% due to more efficient path pruning.6 By modeling the “network effect” of thoughts, GoT aligns more closely with human brainstorming, where ideas are interconnected rather than strictly hierarchical.

2.3 Algorithmic Internalization: Algorithm of Thoughts (AoT)

Addressing the latency and cost issues of ToT and GoT, the Algorithm of Thoughts (AoT) framework proposes a radical shift: internalizing the search process. Instead of relying on an external controller to manage the search state, AoT prompts the LLM to simulate the search algorithm within a single context window.10

AoT utilizes the in-context learning capabilities of LLMs. By providing few-shot examples of a search trajectory (e.g., a text representation of a DFS traversal, including explicitly marked “dead ends” and “backtracking” steps), the model learns to generate the entire search process in one continuous output. This eliminates the overhead of multiple API calls. The model effectively becomes its own search engine, managing the “frontier” of unvisited nodes and the “history” of visited ones within its working memory.10

Comparative Analysis of Structured Prompting Frameworks

 

Framework Topology Control Mechanism Key Operations Strengths Weaknesses
Chain-of-Thought (CoT) Linear Chain Autoregressive Next-Token Next-Step Low Latency, Simple Error Propagation, No Backtracking 1
Tree of Thoughts (ToT) Tree (Hierarchical) External (BFS/DFS) Branch, Prune, Backtrack High Accuracy, Exploration High Latency, High Cost, Rigid Hierarchy 7
Graph of Thoughts (GoT) Graph (DAG/Cyclic) External (Graph Operations) Aggregate, Refine, Loop Information Synthesis, Flexible Complex Implementation, Context Load 6
Algorithm of Thoughts (AoT) Dynamic Path (Simulated) Internal (In-Context) Simulated Search, Recursive logic Token Efficiency, Single Call Limited by Context Window, Working Memory 10

2.4 Parallelization and Efficiency: Skeleton and Forest of Thought

While ToT and GoT focus on maximizing accuracy through exhaustive search, the Skeleton of Thought (SoT) and Forest of Thought (FoT) frameworks target the efficiency-accuracy trade-off.9 SoT operates on a “Plan-then-Execute” paradigm. The model is first prompted to generate a concise “skeleton” or outline of the answer. Once this structural backbone is established, the expansion of each point is parallelized. This not only reduces end-to-end latency (as multiple sections can be generated simultaneously by different model instances) but also improves coherence by fixing the high-level structure before the details are filled in.8

Forest of Thought (FoT) combines the breadth of ToT with the parallelization of SoT. It initiates multiple reasoning trees in parallel (a “forest”), leveraging collective decision-making. By aggregating the outcomes of multiple trees, FoT mitigates the risk of a single tree converging on a suboptimal local maximum. This approach aligns with the “Ensemble of Reasoning” hypothesis: that diversity in the solution space is a critical component of robust reasoning.12

2.5 System 2 Alignment and the “Chain-of-Draft”

Recent research into System 2-aligned models highlights a fundamental tension: deep reasoning is verbose. Models trained or prompted to behave like “System 2” thinkers (deliberate, analytical) generate significantly longer outputs than “System 1” (intuitive) models.3 This verbosity improves performance on arithmetic and symbolic tasks but can be detrimental to simple commonsense queries, leading to “over-thinking” or “hallucinated complexity.”

To manage this, the Chain-of-Draft (CoD) technique encourages the model to generate a minimal, syntactically simplified “draft” of the reasoning process.13 Instead of full natural language sentences (“First, I will calculate the value of X…”), CoD prompts for a dense, code-like or abbreviated representation. This reduces the token count (and thus the cost) of the reasoning trace while preserving the logical benefits of intermediate steps. It represents a move toward “efficient System 2” thinking, optimizing the information density of the reasoning chain.

3. Inference-Time Compute Scaling: The Engine of Deliberation

As Pre-training Scaling Laws (increasing model parameters) approach diminishing returns, the field has pivoted to Inference-Time Compute Scaling (or Test-Time Scaling). This paradigm posits that the performance of a model is not fixed after training but can be dynamically scaled by allocating more computational resources during the inference phase.4

The theoretical underpinning of this shift is the realization that complex problems often require a search over a solution space that is too large to be traversed by a single “greedy” pass. By allowing the model to “think longer”—generating more candidates, verifying them, and refining them—we can unlock capabilities that are otherwise latent.

3.1 Probabilistic Inference: Particle Filtering and “Rollout Roulette”

One of the most mathematically rigorous approaches to inference scaling involves reframing generation as a probabilistic inference task. Standard decoding methods like Beam Search optimize for the mode of the distribution (the single most likely sequence). However, in reasoning tasks, the “most likely” next token is often a generic or safe continuation, not necessarily the one that leads to the correct solution.

Particle Filtering (PF), applied to LLMs (often termed “Rollout Roulette” in 2025 literature), aims to explore the typical set of the distribution.16 The LLM is treated as a State Space Model (SSM), where the “state” is the partial reasoning trace.

  • Initialization: The filter starts with a set of $N$ particles (reasoning chains).
  • Propagation: At each step, new tokens are sampled for each particle using the LLM’s transition probabilities.
  • Rewarding: A Process Reward Model (PRM) assigns a weight to each particle, estimating the probability that this partial chain leads to a correct answer.
  • Resampling: Particles are resampled based on these weights. High-potential chains are duplicated (split), while low-potential ones are pruned.

This method allows the inference process to maintain a diverse population of hypotheses. Crucially, it prevents the “collapse” seen in Beam Search, where the beam fills up with variations of a single, slightly suboptimal path. Empirical results on the MATH500 benchmark show that Particle Filtering methods achieve a 4-16x better scaling rate than deterministic search counterparts.17 This suggests that for hard reasoning problems, maintaining diversity (the “typical set”) is more valuable than maximizing local probability (the “mode”).

3.2 Heuristic Search: The Q* Framework

Parallel to probabilistic methods are approaches rooted in heuristic search, most notably the Q* (Q-Star) framework.19 Q* formalizes multi-step reasoning as a Markov Decision Process (MDP):

  • State ($s_t$): The current context (question + reasoning so far).
  • Action ($a_t$): The next reasoning step or thought.
  • Reward ($R$): The binary correctness of the final answer (often delayed).

The core innovation of Q* is the training of a Q-value function (or Q-value Model) that estimates the expected future reward of a current state-action pair: $Q(s_t, a_t)$. This Q-value serves as an admissible heuristic for an A Search* algorithm. Unlike Monte Carlo Tree Search (MCTS), which requires expensive “rollouts” (simulating the game to the end) to estimate the value of a node, a trained Q-function provides an immediate, low-cost estimate.19

This transforms the decoding process from a “blind” autoregressive walk into a “guided” best-first search. The model can look at three possible next steps, query the Q-model for their long-term value, and pursue the most promising one. Experimental validations on GSM8K and MATH datasets demonstrate that Q* significantly outperforms standard CoT and Majority Voting strategies by effectively navigating the reasoning graph and avoiding “dead ends” that simple probability would not detect.23

3.3 A* Decoding and Token Efficiency

Complementing Q* is A Decoding*, which explicitly targets the efficiency of the search.24 While methods like “Best-of-N” (generating N independent solutions) improve accuracy, they are wasteful, as they often generate N full failures to find one success. A* Decoding treats generation as a shortest-path problem on a graph where the “cost” of an edge is inversely related to its probability of correctness.

By using a PRM to provide the heuristic cost, A* Decoding prioritizes expanding the most promising partial sequences. If a reasoning path begins to yield low PRM scores, the search algorithm abandons it early (pruning), saving the compute that would have been wasted completing a doomed trajectory. This “fail fast” mechanism allows A* Decoding to achieve the same accuracy as Best-of-N with a fraction of the token budget, effectively shifting the Pareto frontier of inference efficiency.24

3.4 The Verification Dilemma: Generative vs. Discriminative

A critical component of all search-based inference is the “Verifier” or “Reward Model”—the system that judges the quality of the generated text. A significant debate in the 2024-2025 literature centers on the architecture of these verifiers: Generative versus Discriminative.25

3.4.1 Discriminative Verifiers (ORMs/PRMs)

Discriminative verifiers are trained as classifiers. They take a text sequence (question + solution) and output a scalar score representing the probability of correctness ($P(\text{Correct} | \text{Input})$).

  • Pros: Fast and cheap (single forward pass).
  • Cons: They often suffer from “oversmoothing” and struggle to detect subtle logical errors. They act as “black boxes,” providing a score without an explanation, which makes them prone to “reward hacking” (being fooled by surface-level features like length or formatting).25

3.4.2 Generative Verifiers (GenRM)

Generative verifiers leverage the LLM’s own text generation capabilities to “think” about the verification. They produce a “Verification Chain-of-Thought” (e.g., “Let me check the first step… The derivation of X is correct… The second step has a sign error…”) followed by a final verdict.25

  • Pros: Significantly more accurate, especially for hard problems. The CoT forces the model to attend to specific details, reducing hallucination.
  • Cons: Computationally expensive. Verifying a solution might take as many tokens as generating it.

Inference Scaling Trade-offs:

Recent studies on Inference Scaling Laws for GenRM reveal a complex trade-off between generating more solutions (Exploration) and verifying them better (Exploitation).29

  • At low compute budgets, simple Self-Consistency (SC) (generating multiple solutions and voting) outperforms complex verification. The cost of the verifier is better spent on just trying again.
  • At high compute budgets, Generative Verification becomes dominant. As the number of candidate solutions grows, the “distractor” solutions (plausible but wrong) overwhelm simple voting. A strong GenRM is needed to filter these out.

The Generator-Verifier Gap research highlights that weak generators produce errors that are easy to detect, but strong generators produce “plausible hallucinations” that require equally strong generative verifiers to catch.28 This suggests that as models get smarter, the cost of verifying them will grow linearly or super-linearly, cementing the shift to a compute-intensive inference paradigm.

Comparison of Verification Strategies

 

Strategy Mechanism Compute Cost Best For Scaling Law Behavior
Self-Consistency (SC) Majority Vote of $N$ samples Low (per sample) Low/Med Budgets Logarithmic gain
Discriminative Verifier Scalar score ranking Low (1 pass) High-throughput Plateaus early (Oversmoothing)
Generative Verifier (GenRM) CoT Critique + Verdict High (N tokens) High Budgets, Hard Tasks Linear/Super-linear gain 29
Hybrid (WSC) Weighted Vote (Score * Count) Low/Med General Purpose Balanced Trade-off 31

4. Reinforcement Learning and the Incentivization of Reasoning

While prompting and inference scaling optimize the deployment of models, the most fundamental improvements in 2025 have come from novel training paradigms. The release of DeepSeek-R1 has provided a definitive proof-of-concept that reasoning capabilities can be incentivized to emerge from scratch through pure Reinforcement Learning (RL).32

4.1 The Emergence of “DeepSeek-R1-Zero”

Traditionally, reasoning models were trained via Supervised Fine-Tuning (SFT) on thousands of human-annotated reasoning traces (e.g., GSM8K, MATH). The DeepSeek-R1 research challenged this dogma. The authors trained DeepSeek-R1-Zero using large-scale RL (specifically the GRPO algorithm) on a base model, without any initial SFT supervision on reasoning data.35

The training setup was deceptively simple: the model was given a problem and a binary reward signal (Correct/Incorrect). It was not told how to reason. Remarkably, sophisticated reasoning behaviors—including self-verification (“Wait, let me check that”), backtracking, and long Chain-of-Thought generation—emerged naturally from the optimization process.33 The model “discovered” that to maximize the reward (solving the hard math problem), it had to spend more tokens processing the information. This serves as a powerful validation of the Instrumental Convergence hypothesis: reasoning is an instrumental goal for solving complex tasks.

However, R1-Zero exhibited significant “usability” issues. Its reasoning traces were often chaotic, mixed multiple languages, or contained infinite loops. While it got the right answer, the process was illegible to humans.32 This highlighted a distinction between internal reasoning (effective for the model) and external reasoning (useful for humans).

4.2 The “Cold Start” and Distillation Pipeline

To address the readability issues, the full DeepSeek-R1 pipeline reintroduced a “Cold Start” phase. A small dataset of high-quality, readable CoT examples was used to fine-tune the base model before the RL phase.35 This “primed” the model to reason in a structured, human-legible format, which the subsequent RL then optimized.

Crucially, the research demonstrated the power of Reasoning Distillation. The reasoning patterns generated by the massive R1 model (671B parameters) could be distilled into smaller, dense models (e.g., 7B, 32B) by fine-tuning the small models on the outputs of the large model.36

  • Result: The distilled 32B model outperformed non-reasoning models orders of magnitude larger (e.g., GPT-4o-mini) on benchmarks like AIME 2024 (Pass@1: 72.6%).36
  • Implication: Reasoning is not solely a function of parameter scale. It is a learnable representation or “style” of processing that can be transferred from a teacher to a student. This democratizes high-end reasoning, allowing efficient small models to punch above their weight class.

4.3 Process Reward Models (PRMs) and Dense Supervision

While R1 used outcome rewards, the broader field is moving toward Process Reward Models (PRMs) for dense supervision.37 PRMs address the “sparse reward” problem. In a 100-step proof, an outcome reward gives no feedback on where the error occurred. A PRM assigns a score to each step.

Challenges in PRM Training:

  1. Annotation Cost: Human labeling of every step is prohibitively expensive.
  2. Self-Correction: Standard PRM data assumes that if a step is wrong, the whole chain is dead. However, reasoning models often make a mistake and then fix it. A standard PRM would penalize this successful self-correction.
  • Solution: New annotation protocols like “Error Propagation vs. Error Cessation” 40 are being developed to teach PRMs to recognize and reward the act of correction, not just the absence of errors.
  • ThinkPRM: Recent work on “ThinkPRM” uses the inherent reasoning abilities of Long-CoT models to generate synthetic data for PRM training, allowing for data-efficient fine-tuning on orders of magnitude fewer labels.41

5. Architectural Modifications: Beyond the Transformer

While prompting and inference scaling operate on the software level, researchers are now modifying the hardware of the neural architecture itself to better support reasoning.

5.1 Mutual Information and “Thinking Tokens”

Information-theoretic analysis of Large Reasoning Models (LRMs) has revealed a phenomenon known as Mutual Information (MI) Peaks.43 During a reasoning trace, the mutual information between the model’s hidden states and the correct answer does not accrue linearly. Instead, it spikes at specific tokens—often linguistic markers like “Wait”, “Therefore”, or “Hmm”.

These tokens, termed Thinking Tokens, act as “control nodes.” They represent moments where the model consolidates information, reduces entropy, and pivots its internal state. Experiments show that suppressing these tokens catastrophically degrades performance, while suppressing random tokens has minimal impact.44 This has led to the proposal of Reflection Tokens—specialized vocabulary items that explicitly trigger a “pause and check” operation, effectively baking System 2 behavior into the model’s vocabulary.43

5.2 Dynamic Associative Memory: The CoAT Framework

A fundamental limitation of the Transformer is its lack of a persistent “working memory”; it must reconstruct context from the history at every step (quadratic complexity). This leads to the “Working Memory Cliff,” where performance drops sharply as the number of variables to track increases (e.g., sorting >30 items).46

The Chain-of-Associated-Thoughts (CoAT) framework addresses this by integrating a Dynamic Associative Memory module.47

  • Mechanism: CoAT uses a dual-stream architecture. As the model generates reasoning steps (stream 1), it embeds key information and stores it in an external vector database (stream 2).
  • Synergy: This memory acts as a “blackboard.” When the model generates a new step, it queries the memory for relevant prior associations. This allows it to recall a constraint defined 4000 tokens ago without needing to attend to it directly in the context window. It also supports the MCTS planner by allowing different branches of the search tree to share information.47

5.3 Hierarchical and Recurrent Architectures

The Hierarchical Reasoning Model (HRM) proposes a bio-inspired architecture using nested recurrent modules.49

  • Structure: HRM consists of a “Fast/Low-level” module (System 1) that processes tokens rapidly, and a “Slow/High-level” module (System 2) that updates less frequently.
  • Function: The High-level module provides “context vectors” or strategic guidance to the Low-level module. This separation allows the model to maintain a stable long-term plan (High-level) while executing the tactical token generation (Low-level), addressing the issue where local syntax corrections distract from global logic.
  • Efficiency: HRM uses “one-step gradient” updates and adaptive computation time, allowing it to “think” (loop) as long as necessary for hard problems, decoupling compute from input length.49

5.4 Neuro-Symbolic Integration (NeSy)

Finally, Neuro-Symbolic (NeSy) architectures seek to bridge the gap between probabilistic LLMs and deterministic logic.50

  • Hybrid Models: The LLM acts as a “neural parser,” translating natural language into a symbolic representation (e.g., Python code, logic predicates). A symbolic solver (e.g., a Python interpreter or Theorem Prover) executes the logic, and the result is fed back to the LLM. This “Code-as-Reasoning” paradigm is becoming standard for math tasks, bypassing the LLM’s arithmetic weaknesses.52
  • Integrative Models: Newer approaches attempt to encode logical rules within the neural weights (e.g., Logic Neural Networks), allowing the model to be differentiable while satisfying logical constraints. These are less mature but promise true end-to-end logical reasoning.52

6. Comprehensive Comparative Analysis

To understand the efficacy of these diverse approaches, it is necessary to compare them across key dimensions: Reasoning Depth, Computational Cost, and Implementation Complexity.

6.1 Performance Benchmarking (MATH / GSM8K)

 

Methodology Benchmark Performance (Approx.) Inference Cost Multiplier Key Characteristic
Standard CoT Baseline (e.g., ~50% MATH) 1x Linear, brittle
Self-Consistency (SC) +5-10% over Baseline $N$x (e.g., 40x) Simple, effective for easy tasks
Tree of Thoughts (ToT) +10-20% over Baseline 100x+ High accuracy, very slow
DeepSeek-R1 (RL) SOTA (97.3% MATH-500) 36 Variable (Long CoT) Emergent reasoning, “Aha moments”
Q / Q-Value Search* Superior to SC at fixed budget 23 Moderate Guided search, efficient
Particle Filtering 4-16x better scaling than Beam 17 High Best for “needle in haystack” reasoning

Analysis:

  • DeepSeek-R1 demonstrates that training the model to reason (via RL) is currently the most effective single intervention, achieving 97.3% on MATH-500.36
  • Inference Scaling (Q*, PF) acts as a multiplier. A strong reasoning model (like R1) combined with Particle Filtering would likely define the new state-of-the-art, though at significant cost.
  • ToT is largely being superseded by AoT (for efficiency) and Q* (for better search guidance), as the overhead of external controllers is too high for production systems.

6.2 The “Accuracy-Efficiency” Trade-off

The recurrent theme across all research is the trade-off between Accuracy and Efficiency.

  • System 1 Models (Standard LLMs) are efficient (linear time) but inaccurate on hard tasks.
  • System 2 Models (R1, ToT, CoAT) are accurate but expensive.
  • Hybrid Approaches (System 2 Alignment, A* Decoding) attempt to navigate the Pareto frontier. For example, A* Decoding abandons unpromising paths early (“fail fast”), saving compute for the promising ones.24

7. Future Directions: The Compute Economy

The evolution of LLMs is transitioning from a phase of “Knowledge Acquisition” (Pre-training) to “Reasoning Optimization” (Inference/RL).

7.1 The Rise of Generative Verification

As generators become stronger, Discriminative Verifiers (classifers) are failing. They cannot distinguish between a “subtle error” and a “correct complex derivation.” Generative Verification (GenRM) will likely become standard for high-stakes domains (medicine, engineering), despite the cost. We will see the emergence of “Verifier-Specific Models”—LLMs trained solely to critique the reasoning of others.28

7.2 The “Black Box” of Emergent Reasoning

The success of R1-Zero poses a safety challenge. If models develop their own “internal languages” or reasoning shortcuts to maximize rewards, how do we ensure alignment? Research into Mechanistic Interpretability of “Thinking Tokens” and Cold Start priming will be critical to keeping these “Alien” reasoning processes legible to humans.35

7.3 Hybrid Neuro-Symbolic Architectures

The “Working Memory Cliff” 46 suggests that Transformers alone cannot solve infinite-horizon problems. We expect the integration of Associative Memory Modules (like CoAT) and Symbolic Solvers to become deeper, moving from “Tool Use” (API calls) to “Native Integration” (differentiable logic layers) within the next generation of architectures.

8. Conclusion

Improving LLMs’ ability to break down complex problems is no longer about finding the “perfect prompt.” It has evolved into a multi-disciplinary engineering challenge. It requires training models to value reasoning via RL (DeepSeek-R1), equipping them with the right cognitive topologies (GoT, CoAT), guiding their inference with probabilistic search (Particle Filtering, Q*), and supporting them with memory-augmented architectures (HRM).

The “System 2” transition is well underway. The AI of 2025 does not just predict the next word; it navigates a decision tree, consults a memory bank, simulates a future state, and verifies its own logic—all before generating a single token of output. This shift from “Generation” to “Deliberation” marks the maturation of Large Language Models into true Reasoning Engines.

Works cited

  1. arXiv:2502.12134v1 [cs.CL] 17 Feb 2025, accessed on December 22, 2025, https://arxiv.org/pdf/2502.12134
  2. On reasoning versus inference-time scaling | Red Hat Developer, accessed on December 22, 2025, https://developers.redhat.com/articles/2025/02/17/reasoning-versus-inference-time-scaling
  3. Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2502.12470v1
  4. Inference-Time Techniques for LLM Reasoning – Berkeley RDI, accessed on December 22, 2025, https://rdi.berkeley.edu/adv-llm-agents/slides/inference_time_techniques_lecture_sp25.pdf
  5. Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead – Microsoft, accessed on December 22, 2025, https://www.microsoft.com/en-us/research/wp-content/uploads/2025/03/Inference-Time-Scaling-for-Complex-Tasks-Where-We-Stand-and-What-Lies-Ahead-2.pdf
  6. Graph of Thoughts: Solving Elaborate Problems with Large Language Models, accessed on December 22, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/29720/31236
  7. Tree of Thoughts (ToT) – Prompt Engineering Guide, accessed on December 22, 2025, https://www.promptingguide.ai/techniques/tot
  8. Something-of-Thought in LLM Prompting: An Overview of Structured LLM Reasoning, accessed on December 22, 2025, https://towardsdatascience.com/something-of-thought-in-llm-prompting-an-overview-of-structured-llm-reasoning-70302752b390/
  9. Large Language Model Reasoning Process and Prompting techniques Part 1 – Xin Cheng, accessed on December 22, 2025, https://billtcheng2013.medium.com/large-language-model-reasoning-process-and-prompting-techniques-part-1-e3c31a78f1a0
  10. How Algorithm of Thoughts Prompting Works – PromptHub, accessed on December 22, 2025, https://www.prompthub.us/blog/how-algorithm-of-thoughts-prompting-works
  11. What is an Algorithm of Thoughts (AoT) and How Does it Work? – Analytics Vidhya, accessed on December 22, 2025, https://www.analyticsvidhya.com/blog/2024/07/algorithm-of-thoughts/
  12. Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2412.09078v1
  13. open-thought/system-2-research: System 2 Reasoning Link Collection – GitHub, accessed on December 22, 2025, https://github.com/open-thought/system-2-research
  14. How Scaling Laws Drive Smarter, More Powerful AI – NVIDIA Blog, accessed on December 22, 2025, https://blogs.nvidia.com/blog/ai-scaling-laws/
  15. AI Inference Time Scaling Laws Explained – Supermicro, accessed on December 22, 2025, https://learn-more.supermicro.com/data-center-stories/ai-inference-time-scaling-laws-explained
  16. A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2502.01618v3
  17. Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2502.01618v5
  18. A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods – ChatPaper, accessed on December 22, 2025, https://chatpaper.com/paper/104240
  19. Improving Multi-Step Reasoning in Large Language Models – Hackernoon, accessed on December 22, 2025, https://hackernoon.com/improving-multi-step-reasoning-in-large-language-models
  20. Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning – OpenReview, accessed on December 22, 2025, https://openreview.net/forum?id=F7QNwDYG6I
  21. This overview provides a foundational understanding of Q* and its application in multi-step AI reasoning. – GitHub Gist, accessed on December 22, 2025, https://gist.github.com/Cdaprod/b110d346d8b45d72b0872e15144ee6ae
  22. Q*: Enhanced Multi-Step Reasoning for LLMs – Emergent Mind, accessed on December 22, 2025, https://www.emergentmind.com/papers/2406.14283
  23. Q*: A Versatile Artificial Intelligence AI Approach to Improve LLM Performance in Reasoning Tasks – MarkTechPost, accessed on December 22, 2025, https://www.marktechpost.com/2024/06/27/q-a-versatile-artificial-intelligence-ai-approach-to-improve-llm-performance-in-reasoning-tasks/
  24. A*-Decoding: Token-Efficient Inference Scaling – Emergent Mind, accessed on December 22, 2025, https://www.emergentmind.com/papers/2505.13672
  25. Generative Verifiers: Reward Modeling as Next-Token Prediction – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2408.15240v3
  26. Generative Verifiers: Reward Modeling as Next-Token Prediction – ResearchGate, accessed on December 22, 2025, https://www.researchgate.net/publication/383460947_Generative_Verifiers_Reward_Modeling_as_Next-Token_Prediction
  27. [Quick Review] Generative Verifiers: Reward Modeling as Next-Token Prediction – Liner, accessed on December 22, 2025, https://liner.com/review/generative-verifiers-reward-modeling-as-nexttoken-prediction
  28. Variation in Verification: Understanding Verification Dynamics in Large Language Models, accessed on December 22, 2025, https://arxiv.org/html/2509.17995v1
  29. When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning – OpenReview, accessed on December 22, 2025, https://openreview.net/pdf?id=qvKfyns8ry
  30. When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning – arXiv, accessed on December 22, 2025, https://arxiv.org/html/2504.01005v2
  31. Budget-aware Test-time Scaling via Discriminative Verification – ChatPaper, accessed on December 22, 2025, https://chatpaper.com/paper/200289
  32. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, accessed on December 22, 2025, https://www.researchgate.net/publication/388317525_DeepSeek-R1_Incentivizing_Reasoning_Capability_in_LLMs_via_Reinforcement_Learning
  33. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning – arXiv, accessed on December 22, 2025, https://arxiv.org/pdf/2501.12948
  34. Deepseek-R1 Incentivizes Reasoning in Llms Through Reinforcement Learning – Scribd, accessed on December 22, 2025, https://www.scribd.com/document/919531060/s41586-025-09422-z
  35. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, accessed on December 22, 2025, https://arxiv.org/html/2501.12948v1
  36. deepseek-ai/DeepSeek-R1 – Hugging Face, accessed on December 22, 2025, https://huggingface.co/deepseek-ai/DeepSeek-R1
  37. [2510.08049] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models – arXiv, accessed on December 22, 2025, https://arxiv.org/abs/2510.08049
  38. Process-supervised Reward Models (PRMs) – Emergent Mind, accessed on December 22, 2025, https://www.emergentmind.com/topics/process-supervised-reward-models-prm
  39. R-PRM: Reasoning-Driven Process Reward Modeling – ACL Anthology, accessed on December 22, 2025, https://aclanthology.org/2025.emnlp-main.679.pdf
  40. Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning – ACL Anthology, accessed on December 22, 2025, https://aclanthology.org/2025.findings-emnlp.253.pdf
  41. Process Reward Models That Think – arXiv, accessed on December 22, 2025, https://arxiv.org/pdf/2504.16828
  42. Process Reward Models That Think | OpenReview, accessed on December 22, 2025, https://openreview.net/forum?id=V727xqBYIW
  43. Reflection Tokens in LLM Reasoning – Emergent Mind, accessed on December 22, 2025, https://www.emergentmind.com/topics/reflection-tokens
  44. Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning | OpenReview, accessed on December 22, 2025, https://openreview.net/forum?id=E1FrjgaG1J
  45. Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning | alphaXiv, accessed on December 22, 2025, https://www.alphaxiv.org/overview/2506.02867v1
  46. How Much Can You Ask an LLM to Track? Finding the Working Memory Cliff – Ian Bull, accessed on December 22, 2025, https://ianbull.com/posts/working-memory-cliff/
  47. CoAT: Chain-of-Associated-Thoughts Framework for Enhancing Large Language Models Reasoning – ChatPaper, accessed on December 22, 2025, https://chatpaper.com/paper/104926
  48. Chain-of-Associated-Thoughts (CoAT): An AI Framework to Enhance LLM Reasoning, accessed on December 22, 2025, https://www.marktechpost.com/2025/02/06/chain-of-associated-thoughts-coat-an-ai-framework-to-enhance-llm-reasoning/
  49. Hierarchical Reasoning Models: Thinking in Layers | Apolo AI …, accessed on December 22, 2025, https://www.apolo.us/blog-posts/hierarchical-reasoning-models-thinking-in-layers
  50. Neuro-Symbolic AI: A Foundational Analysis of the Third Wave’s Hybrid Core, accessed on December 22, 2025, https://gregrobison.medium.com/neuro-symbolic-ai-a-foundational-analysis-of-the-third-waves-hybrid-core-cc95bc69d6fa
  51. Neuro-Symbolic AI: The Comeback of Logic in an LLM World – Insights2TechInfo, accessed on December 22, 2025, https://insights2techinfo.com/neuro-symbolic-ai-the-comeback-of-logic-in-an-llm-world/
  52. A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning, accessed on December 22, 2025, https://openreview.net/forum?id=uO0oaNY9fC&referrer=%5Bthe%20profile%20of%20Michael%20K.%20Chen%5D(%2Fprofile%3Fid%3D~Michael_K._Chen1)
  53. (PDF) Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection – ResearchGate, accessed on December 22, 2025, https://www.researchgate.net/publication/397701162_Scaling_Generative_Verifiers_For_Natural_Language_Mathematical_Proof_Verification_And_Selection