The Emergence of General-Purpose Robotic Intelligence: An Analysis of Foundation Models and the RT-1 to RT-2 Trajectory

Section 1: The Foundation Model Paradigm: A New Architecture for Intelligence

The field of artificial intelligence (AI) is undergoing a fundamental transformation, moving away from an era of narrow specialization towards one of general-purpose capability. This paradigm shift is driven by the advent of “foundation models”—large-scale, deep learning neural networks trained on massive, diverse datasets.1 These models represent a departure from the traditional approach of developing bespoke AI systems from scratch for each specific task. Instead, they serve as a powerful, reusable infrastructure that can be adapted to a wide range of applications with significantly reduced time and cost.3 This evolution is not merely an incremental improvement; it redefines the architecture of intelligence itself, with profound implications for every domain AI touches, most notably the complex and physically grounded world of robotics.

 

1.1. From Task-Specific ML to General-Purpose AI

 

For decades, the dominant methodology in machine learning (ML) involved the creation of highly specialized models. An organization seeking to forecast sales trends, analyze text for sentiment, and classify images would need to develop three distinct, siloed models, each trained on a custom, meticulously labeled dataset.1 This approach, while effective for narrow problems, is resource-intensive and does not scale efficiently. Each new problem required a new development cycle, limiting the speed of innovation and the breadth of AI adoption.

Foundation models invert this logic. The term, coined by researchers at Stanford University’s Center for Research on Foundation Models (CRFM), describes a model that is “critically central yet incomplete”.2 It is central because it encapsulates a vast repository of general world knowledge learned during a massive pre-training phase. It is incomplete because it is not designed for a single end-task but is intended to be adapted—or fine-tuned—to serve as the basis for a multitude of downstream applications.2 Prominent examples like the large language models (LLMs) GPT-4 and Claude 2, or the text-to-image model Stable Diffusion, demonstrate this versatility. A single foundation model can be prompted to write blog posts, solve math problems, generate computer code, and engage in dialogue, showcasing a breadth of capability that was previously unattainable.1 This shift from bespoke, single-use models to a common, adaptable foundation accelerates development, reduces costs, and democratizes access to powerful AI capabilities.3

The strategic implications of this shift are significant. The competitive advantage in AI is no longer solely defined by possessing a large, labeled dataset for a specific task. Instead, the primary strategic asset becomes the access to immense quantities of broad, unstructured, multimodal data and the vast computational infrastructure required to pre-train a base model. This dynamic inherently favors large, data-rich corporations with the resources to undertake such a monumental initial investment.6 However, this concentration of power at the foundational layer simultaneously creates a vibrant new market for a broader ecosystem of innovators. These smaller, more agile players can focus on the “last mile” problem: collecting high-quality, specialized datasets for fine-tuning these powerful base models for specific, high-value vertical applications. This suggests a future industry structure composed of a few large-scale “foundation model providers” and a much larger, diverse ecosystem of “application developers” who build specialized solutions on top of this shared infrastructure.

 

1.2. Core Principles: Pre-training, Transfer Learning, and Emergent Capabilities

 

The power of foundation models stems from a confluence of established machine learning techniques applied at an unprecedented scale. The core principles underpinning their success are pre-training on broad data, adaptability through transfer learning, and the phenomenon of emergent capabilities.

Pre-training on Broad Data: The initial and most resource-intensive phase in creating a foundation model is pre-training. This involves training a model on vast, often petabyte-scale, datasets of unlabeled and unstructured data, such as the text and images of the public internet.1 The learning process is typically “self-supervised,” meaning the model generates its own training signals from the data itself—for example, by learning to predict the next word in a sentence or fill in a masked portion of an image.1 This allows the model to learn intricate patterns, semantic relationships, contextual nuances, and a form of “common sense” world knowledge without the need for costly and time-consuming human labeling.2 The Transformer architecture has become the de facto standard for building these models, as its self-attention mechanism is highly effective at processing long-range dependencies in sequential data and scales efficiently with massive datasets and model sizes.2

Adaptability via Transfer Learning: The general knowledge encoded within the model’s parameters during pre-training is not an end in itself. Its true utility is realized through transfer learning, where the pre-trained model is adapted for a specific downstream task.2 The most common adaptation method is fine-tuning, which involves continuing the training process on a much smaller, task-specific dataset.3 Because the model already possesses a rich understanding of language, vision, or other modalities, it can learn the new task with far less data and computational effort than a model trained from scratch.3 This efficiency is a key driver of the foundation model paradigm.

Emergent Capabilities: Perhaps the most remarkable and scientifically intriguing aspect of foundation models is the emergence of capabilities that were not explicitly programmed or trained for.4 As models are scaled up in terms of parameter count and training data volume, they begin to exhibit surprising new abilities. These include “in-context learning,” where a model can perform a task it has never seen before simply by being shown a few examples in its prompt, and more complex reasoning skills, often induced through techniques like “Chain of Thought” (CoT) prompting, where the model is encouraged to break down a problem into intermediate steps.9 This ability to generalize and reason in a zero-shot or few-shot manner is a hallmark of these large-scale models and a critical step towards more general forms of intelligence.8

 

1.3. The Challenge of Embodiment: Applying Foundation Models to the Physical World

 

While foundation models have revolutionized the digital domains of language and vision, applying this paradigm to the physical world of robotics presents a unique and formidable set of challenges. A robot foundation model is conceptualized as a large-scale, versatile model that can serve as a building block for a wide array of physical tasks, from manipulation to navigation.7 The goal is to move beyond rigid, pre-programmed robots or models trained for a single task, towards general-purpose machines that can adapt to new instructions, objects, and environments with minimal fine-tuning.7 However, the bridge from digital bits to physical atoms is fraught with complexities that are far less pronounced in other AI domains.

The primary and most persistent obstacle is data scarcity. Unlike the trillions of words and billions of images readily available on the internet, high-quality robotics data—comprising synchronized sensor inputs (vision, force, touch) and motor commands—is exceptionally difficult, expensive, and time-consuming to collect.7 Each data point requires a physical robot, a controlled environment, and often, a human operator for teleoperation or demonstration. This data bottleneck has historically been the single greatest impediment to scaling up robot learning.

A second major challenge is the simulation-to-reality (sim-to-real) gap. While training in high-fidelity simulators offers a scalable way to generate data, models trained exclusively in simulation often fail when deployed on physical robots.7 Simulators struggle to perfectly capture the complex physics of contact and friction, the nuances of sensor noise, and the infinite visual variety of real-world lighting and textures. This discrepancy can lead to policies that are brittle and fail in the face of the unstructured nature of reality.7

Furthermore, robotics demands physical grounding and safety. An LLM generating incorrect text has minimal real-world consequence. A robot executing an incorrect action can cause damage to itself, its environment, or humans.13 A robot foundation model must therefore be able to ground abstract natural language commands (e.g., “gently pick up the egg”) into a precise sequence of low-level motor commands that are physically plausible and safe. This requires a deep, implicit understanding of physics and causality that is not required for purely digital tasks.13

Finally, robotic control operates under strict real-time constraints. A robot manipulating an object or navigating a dynamic environment must perceive, decide, and act at a high frequency, often many times per second. The massive size of foundation models, particularly Transformers, makes low-latency inference a significant computational challenge, especially on the power- and size-constrained computers typically found on mobile robots.8 These four challenges—data scarcity, the sim-to-real gap, physical grounding, and real-time performance—define the unique landscape that any successful robot foundation model must navigate.

 

Section 2: RT-1 – Establishing a Scalable Baseline for Real-World Control

 

In December 2022, researchers at Google introduced Robotics Transformer 1 (RT-1), a model that represented a significant milestone in the application of the foundation model paradigm to robotics. RT-1 was not the final destination but a critical proof-of-concept. Its primary contribution was to provide compelling empirical evidence that a single, large-scale, data-driven Transformer model could be trained to perform a wide variety of real-world physical tasks, demonstrating robust generalization to new instructions and environments. It effectively validated the core hypothesis that the principles of task-agnostic, large-scale training could be successfully translated from the digital to the physical domain.14

 

2.1. Architectural Blueprint: The Robotics Transformer

 

RT-1 is a multi-task model built upon a Transformer architecture, specifically a decoder-only sequence model, designed for real-world robotic control at scale.14 Its architecture was engineered to address two of the core challenges in robotics: the need for a high-capacity model that could absorb diverse data, and the necessity of efficient inference for real-time operation.14

The model’s input consists of a short history of images from the robot’s camera and a natural language instruction describing the desired task.16 A key architectural innovation lies in how these inputs and the corresponding outputs are processed. The entire problem of robotic control is framed as a sequence modeling task, where the model learns to predict a sequence of action tokens based on a sequence of input tokens.14

  • Input Processing and Tokenization: Images are first processed by an EfficientNet model (pre-trained on the ImageNet dataset) to extract visual features. Crucially, this visual processing is conditioned on the natural language instruction using FiLM (Feature-wise Linear Modulation) layers. This technique allows the model to modulate the visual features based on the task description at an early stage, enabling it to focus on task-relevant information—for example, paying more attention to a specific object mentioned in the command.14 The resulting image features are then tokenized. To manage the computational load of processing high-resolution image data, RT-1 incorporates the TokenLearner module, an attention-based mechanism that learns to compress the sequence of image tokens by adaptively selecting the most important information. This compression results in a greater than 2.4x speed-up in inference, a critical optimization that makes real-time control at 3 Hz feasible.14
  • Output Generation and Action Tokenization: The model’s output is a sequence of tokenized actions. The robot’s continuous action space—which includes 7 degrees of freedom for arm movement (x, y, z, roll, pitch, yaw, gripper), 3 for base movement (x, y, yaw), and a mode-switching variable—is discretized into 256 distinct bins for each dimension.16 This converts the complex problem of predicting continuous motor commands into a more manageable classification problem of predicting the correct action “token” for each dimension. By tokenizing both inputs (images, text) and outputs (motor commands), RT-1 elegantly fits the entire perception-to-action pipeline into the powerful and scalable framework of a sequence-to-sequence Transformer model.14

 

2.2. Data as the Bedrock: Analysis of the 130,000-Episode Real-World Dataset

 

The performance of any foundation model is inextricably linked to the scale and diversity of its training data. RT-1 was trained on one of the largest and most diverse real-world robotics datasets of its time, a testament to the “data-first” approach required for this new paradigm.14 The dataset was collected over a 17-month period using a fleet of 13 mobile manipulator robots from Everyday Robots (EDR) operating in real-world office and kitchen environments.16

This extensive data collection effort yielded approximately 130,000 successful task episodes, each annotated with a natural language instruction.16 The dataset was deliberately designed for breadth, encompassing over 700 distinct, high-level skills. These ranged from simple pick-and-place operations to more complex manipulation tasks like opening and closing drawers, retrieving items from within them, placing long objects upright, pulling napkins from a dispenser, and opening jars.16 This diversity was essential for testing the model’s ability to generalize by learning shared patterns across structurally similar tasks, rather than simply memorizing individual skills.

A pivotal element of the training strategy was the incorporation of data from a completely different robot embodiment. In addition to the EDR data, the researchers mixed in data from a fixed-base Kuka arm performing a simple bin-picking task.16 This “cross-embodiment” data mixing was a foundational experiment. It sought to answer a critical question: can a model learn from the experience of one type of robot to improve the performance of another? The results were positive. By training on both datasets, RT-1’s performance on the bin-picking task (which it had never seen performed by an EDR robot) nearly doubled, while its proficiency on the original 700+ EDR tasks was retained.16 This was more than just a clever data augmentation technique; it was a powerful validation of a core idea. It demonstrated that robot experience could be abstracted and transferred, suggesting the feasibility of a universal “language” for robotic actions. This experiment laid the groundwork for a future where data from hundreds of different robots worldwide could be aggregated into a single, massive dataset, directly addressing the data scarcity problem by allowing every robot to learn from the collective experience of all robots. This collaborative data strategy is now a central pillar of subsequent projects like the Open X-Embodiment dataset and the models trained upon it.

 

2.3. Performance and Generalization Capabilities

 

The ultimate measure of a robot foundation model is not just its ability to perform tasks it was trained on, but its ability to generalize to novel situations. In extensive real-world evaluations, RT-1 demonstrated a significant leap forward in this regard.

On the set of over 700 tasks seen during training, the model achieved a high success rate of 97%, confirming its capacity to absorb and reliably execute a wide range of skills.14 However, the more critical tests were those involving generalization. Compared to strong baseline models, RT-1 showed markedly superior performance when faced with new challenges:

  • Unseen Tasks: It was 25% more successful at executing novel combinations of known skills and objects.14
  • Environmental Robustness: It was 36% more successful when operating in cluttered environments with distractor objects and 18% more successful when faced with significant background changes, such as different kitchens with varied lighting.14

This high degree of robustness enabled RT-1 to be successfully integrated into more complex, long-horizon planning frameworks like SayCan, which uses a large language model to decompose high-level commands into a sequence of executable steps. In real kitchen environments, a SayCan system powered by RT-1 for manipulation significantly outperformed baselines, successfully completing task sequences with as many as 50 stages.14 This demonstrated that the model’s reliability and adaptability were sufficient to serve as a capable “motor controller” for a higher-level AI planner.

 

2.4. RT-1’s Contribution: A Robust, Task-Agnostic Control Model

 

RT-1’s primary contribution was to firmly establish the viability of the foundation model approach for real-world robotic control. It provided the first large-scale, empirical demonstration that a single, task-agnostic Transformer model, trained on a sufficiently large and diverse dataset of real-world interactions, could achieve a level of generalization and robustness that surpassed previous techniques.14 It proved that the principles that had led to breakthroughs in NLP and computer vision were not confined to the digital realm but could be successfully applied to the noisy, complex, and physically-grounded domain of robotics.15 By doing so, RT-1 created a new, scalable baseline and a clear architectural path for the development of more advanced, general-purpose robot policies.

 

Section 3: RT-2 – Fusing Web-Scale Knowledge with Physical Action

 

If RT-1 was a demonstration of scaling up learning from robotic data, its successor, Robotics Transformer 2 (RT-2), represented a profound conceptual leap: directly infusing a robot with the vast semantic knowledge of the internet. Announced by Google in July 2023, RT-2 moved beyond learning physical skills in isolation and instead leveraged a pre-existing, web-scale Vision-Language Model (VLM) as its core cognitive engine. This shift from a purpose-built robotics model to an adapted generalist AI model unlocked a new class of “emergent” capabilities, allowing the robot to reason about abstract concepts, understand novel commands, and perform rudimentary problem-solving in ways that were impossible with RT-1.18

 

3.1. The Vision-Language-Action (VLA) Model: A Conceptual Leap

 

RT-2 is defined as a Vision-Language-Action (VLA) model.19 This new architectural classification signifies a fundamental change in design philosophy. While RT-1 was a Transformer built from the ground up for robotics, RT-2 is an existing, powerful VLM—such as Google’s PaLM-E or PaLI-X—that has been adapted for robotic control.19 These VLMs are foundation models that have already been pre-trained on immense internet-scale datasets to perform tasks like visual question answering (VQA) and image captioning. They possess a rich, pre-existing understanding of the visual world and its connection to natural language.20

The central hypothesis behind the VLA approach is that this vast repository of semantic and common-sense knowledge, learned from billions of images and text documents, can be directly transferred to the domain of robotic control.18 Rather than teaching a robot what an “apple” is through thousands of physical examples, a VLA model can inherit this concept from its web-scale pre-training and then only needs to learn the physical actions associated with manipulating it from a much smaller robotics dataset.22 This approach represents a form of cognitive arbitrage: leveraging a massive, pre-existing asset (a trained VLM) to overcome the primary bottleneck in robotics (the scarcity of physical data).

 

3.2. The Key Innovation: Treating Actions as a Language

 

The technical brilliance of RT-2 lies in its simple yet powerful method for bridging the gap between the VLM’s world of text and images and the robot’s world of physical actions. The core innovation is to represent the robot’s motor commands as just another form of language.18

In this framework, the continuous, multi-dimensional action space of the robot (e.g., changes in arm position, gripper state) is discretized and mapped to a sequence of text tokens—effectively, a string of numbers that can be processed by a standard natural language tokenizer.19 For example, a specific arm movement might be represented by the string “1 128 91 241 5 101 127 217”.24

This elegant tokenization scheme reframes the entire robotics problem. The task is no longer to map pixels to motor torques but to perform a “translation” from an input “sentence” (composed of image tokens and a natural language command) to an output “sentence” (composed of action tokens).25 This allows the powerful, pre-trained VLM to be fine-tuned to “speak robot” without requiring any changes to its underlying architecture.20 The model learns to generate sequences of action tokens in the same way it learns to generate sequences of English words, making the integration seamless and highly scalable.

 

3.3. Training Methodology: Co-Fine-Tuning on Web and Robotic Data

 

To teach the VLM this new “language” of robot actions without causing it to forget its vast web-based knowledge—a problem known as “catastrophic forgetting”—RT-2 employs a strategy called co-fine-tuning.20 This is a hybrid approach that combines fine-tuning on new data with continued training on old data.

The VLM’s training set is augmented with the robotics trajectory data (the same dataset used to train RT-1), which consists of sequences of (image, language command, action) triplets.19 Simultaneously, the model continues to be trained on a portion of its original internet-scale vision-language dataset, such as VQA and image captioning tasks.18 This mixed-training regimen forces the model to learn the new action-generation task while simultaneously reinforcing and maintaining its existing semantic and visual capabilities.19 The result is a single, unified model that can function as a VLM (answering questions about an image) and a robot controller (executing a command based on an image), all within the same set of neural network weights.23

 

3.4. Emergent Intelligence: Semantic Reasoning, Symbol Understanding, and Chain-of-Thought

 

The most significant outcome of the RT-2 approach is the emergence of a wide range of intelligent behaviors that were not present in the robot’s training data and could not have been learned from it alone.18 These capabilities are a direct result of transferring knowledge from the web.

  • Semantic Generalization: RT-2 demonstrates an ability to understand abstract and semantic concepts. For instance, when commanded to “pick up the drink for someone who is tired,” the model can correctly identify and select an energy drink from a group of beverages, a piece of reasoning derived entirely from common-sense associations learned from web text, not from any robot demonstration.18 Similarly, it can understand the concept of an “improvised hammer” and select a rock to perform a hammering-like task, or identify “trash” in various forms and know to dispose of it, showcasing a level of conceptual understanding far beyond RT-1’s capabilities.18
  • Symbol Understanding: The model can ground abstract symbols in the physical world. In one remarkable demonstration, RT-2 is commanded to “move the banana to the sum of two plus one.” The robot correctly identifies the number ‘3’ on a piece of paper and places the banana on it.19 The robot was never trained on tasks involving arithmetic or placing objects on numbers; it learned to recognize numbers and perform basic math from its VLM pre-training and was able to connect that abstract knowledge to a physical action.
  • Chain-of-Thought (CoT) Reasoning: RT-2’s architecture as a language model allows it to leverage advanced prompting techniques like Chain of Thought. By instructing the model to first output a natural language plan for how it will accomplish a task before generating the action tokens, its reasoning process becomes more explicit and robust.18 For example, when asked to pick up an improvised hammer, the model might first output the text, “I need a tool to hammer the nail. A rock is heavy and hard, so it can be used as a hammer. I will pick up the rock,” before then generating the action tokens to execute that plan.23 This allows the model to break down complex, multi-stage commands and reason through them logically, a critical step towards more sophisticated problem-solving. This visually grounded planning capability also gives it an advantage over earlier systems like SayCan, which relied entirely on language and could not “see” the real world to inform its plans.19

 

Section 4: A Comparative Analysis: The Generational Leap from RT-1 to RT-2

 

The transition from RT-1 to RT-2 is not merely an incremental update but a generational leap that reflects a fundamental shift in the strategy for building general-purpose robots. While both models share the goal of creating adaptable, multi-task robotic systems, their underlying philosophies, architectures, and resulting capabilities are profoundly different. A direct comparison reveals the magnitude of the advancement and clarifies the specific advantages conferred by the Vision-Language-Action (VLA) approach.

 

4.1. Architectural Evolution and Design Philosophy

 

The core difference between the two models lies in their origin and design philosophy. RT-1 was a purpose-built robotics model, a high-capacity Transformer architecture designed and trained from the ground up specifically for the task of real-world robotic control.14 Its intelligence was derived entirely from the physical data it was shown—the 130,000 demonstrations of robotic tasks. The philosophy behind RT-1 was one of

scaling up robotics-specific learning: the belief that with a large enough and diverse enough dataset of physical interactions, and a sufficiently powerful model, generalization would emerge.

In contrast, RT-2 represents a philosophy of knowledge transfer and cognitive leverage. It is not a new model built from scratch but an adaptation of a massive, pre-existing generalist AI—a Vision-Language Model.19 Its intelligence is a hybrid, combining the vast semantic and visual knowledge absorbed from web-scale data with the specific motor skills learned from the robotics dataset.28 The design philosophy is not to teach the robot everything about the world through physical experience, which is prohibitively slow and expensive, but to give it a massive head start by allowing it to inherit a foundational understanding of concepts, objects, and language from the internet.

 

4.2. Quantitative Performance Review: A Doubling of Generalization on Unseen Tasks

 

The most direct measure of the VLA approach’s success is its impact on performance, particularly on the critical metric of generalization to novel scenarios. While both models performed well on tasks they had seen during training, their capabilities diverged sharply when faced with the unknown.

RT-2 successfully retained the high performance of RT-1 on the original set of tasks from the robotics training data, demonstrating that the co-fine-tuning process did not degrade its ability to execute known skills.19 The true breakthrough, however, was in its zero-shot generalization capabilities. In extensive evaluations involving over 6,000 real-world trials, RT-2 demonstrated a success rate of

62% on previously unseen tasks and scenarios. This was a near-doubling of the 32% success rate achieved by RT-1 under similar conditions.19 Further analysis confirmed this leap, with studies showing an approximate 2x to 3x improvement in emergent skills and generalization when comparing RT-2 to baselines that included RT-1.23

Furthermore, experiments confirmed the importance of the underlying VLM’s scale. Ablation studies comparing RT-2 variants built on different VLM backbones (e.g., a 55B parameter PaLI-X vs. a 5B parameter model) showed a clear trend: the larger the pre-trained model, and thus the more web knowledge it contained, the better its generalization performance as a robot controller.23 This provides strong quantitative evidence that the semantic knowledge transferred from the web is directly responsible for the dramatic improvement in adaptability.

 

4.3. Qualitative Shift: From Pattern Matching to Semantic Understanding

 

Beyond the numbers, the leap from RT-1 to RT-2 represents a qualitative shift in the nature of the robot’s intelligence—a move from sophisticated pattern matching to a nascent form of semantic understanding.

RT-1 demonstrated powerful generalization through interpolation. It could learn the underlying structure of tasks and combine known primitives in novel ways. For example, if it was trained on “pick up the blue block and put it in the red bowl” and “pick up the green sponge and put it in the blue bowl,” it could likely generalize to “pick up the blue block and put it in the blue bowl”.16 It excelled at recombining elements it had already seen.

RT-2, by contrast, demonstrates generalization through extrapolation and reasoning. Its abilities are not confined to the concepts present in its robotics data. It can understand and act upon instructions involving objects, categories, and abstract ideas it has only ever encountered in text and images from the web.19 When RT-2 picks up an energy drink for a “tired person,” it is not matching a visual pattern; it is acting on a semantic association learned from its vast pre-training. This represents a fundamental shift from a robot that learns

how to perform physical actions to a robot that also understands why it is performing them, at least at a conceptual level. This ability to connect language-based user intent to physical action, even for novel concepts, is the defining characteristic of the VLA paradigm and the core of the generational leap from RT-1.

The following table synthesizes these key distinctions, providing a clear, at-a-glance summary of the evolution from RT-1 to RT-2. This structured comparison highlights not just the technical changes but their direct impact on performance and capability, making the strategic significance of the VLA breakthrough immediately apparent.

 

Feature Robotics Transformer 1 (RT-1) Robotics Transformer 2 (RT-2) Significance of the Leap
Model Type Robotics Transformer Vision-Language-Action (VLA) Model Shift from a purpose-built robotics model to leveraging a general-purpose VLM as the core “brain.”
Core Architecture Transformer-based, trained end-to-end on robotic data 16 Pre-trained VLM (PaLM-E, PaLI-X) backbone, co-fine-tuned for action generation 19 Moves from learning from scratch to inheriting a massive base of knowledge.
Training Data Large-scale real-world robot demonstrations (130k episodes) 16 Co-fine-tuned on robot demonstrations AND internet-scale vision-language data 18 The model’s knowledge base expands from just “what the robot has done” to “everything on the web.”
Key Innovation Scalable architecture with tokenization of images and actions 16 Representing robot actions as text tokens, treating control as a language problem 19 This is the crucial bridge that allows web-scale knowledge to be directly translated into physical control commands.
Performance (Seen) High proficiency (97% success on 700+ tasks) 14 Retained high performance on original tasks seen in robot data 19 The new approach does not sacrifice performance on known tasks.
Performance (Unseen) 32% success rate on novel scenarios 19 62% success rate on novel scenarios (nearly 2x improvement) 19 A dramatic, quantifiable improvement in the most critical metric for general-purpose robotics: adaptability to the unknown.
Emergent Capabilities Generalization to new combinations of seen skills and objects 16 Semantic reasoning, symbol understanding, chain-of-thought planning 18 A qualitative shift from pattern recognition to genuine, albeit rudimentary, cognitive abilities. The robot can reason about things it has never seen.

 

Section 5: The Competitive and Collaborative Research Landscape

 

While Google’s RT-1 and RT-2 models are landmark achievements, they exist within a vibrant and rapidly evolving global research ecosystem. Progress in robot foundation models is not occurring in a vacuum but is being driven by a dynamic interplay of competition and collaboration among industrial labs, open-source communities, and academic institutions. Understanding this broader landscape is essential for contextualizing the significance of the RT series and appreciating the diverse strategies being pursued to achieve general-purpose robotic intelligence.

 

5.1. Open-Source Counterparts: The Role of Octo and OpenVLA

 

The development of powerful, open-source alternatives to the closed models from large corporate labs is crucial for democratizing research, enabling reproducibility, and fostering community-driven innovation. Two models, in particular, have emerged as key players in this space.

  • Octo: Developed through a collaboration led by researchers at UC Berkeley, Octo is an open-source, transformer-based generalist robot policy.29 It is pre-trained on a massive dataset of 800,000 robot trajectories from the Open X-Embodiment project, a large-scale effort to aggregate robotics data from multiple institutions.29 Octo’s architecture is specifically designed for flexibility and efficient adaptation. It can be quickly fine-tuned to new robot embodiments, sensor configurations (e.g., different camera setups, force-torque sensors), and action spaces with relatively modest computational resources.29 In evaluations, Octo has demonstrated impressive out-of-the-box performance, outperforming the previous open-source state-of-the-art (RT-1-X) and showing capabilities competitive with the much larger, 55-billion parameter RT-2-X model, especially when instructed with natural language.29
  • OpenVLA: Building directly on the Vision-Language-Action concept pioneered by RT-2, OpenVLA is a powerful, 7-billion parameter open-source VLA model.33 It leverages a pre-trained Llama 2 language model as its backbone and incorporates a sophisticated fused visual encoder that combines features from two strong pre-trained vision models (DINOv2 and SigLIP).34 Trained on a curated dataset of 970,000 trajectories from the Open X-Embodiment collection, OpenVLA has set a new standard for open-source generalist manipulation. Remarkably, despite having seven times fewer parameters, OpenVLA has been shown to outperform the closed-source 55B-parameter RT-2-X in absolute task success rate across a wide range of evaluation tasks and robot embodiments.34 This result powerfully demonstrates that intelligent architectural choices and meticulous data curation can be more impactful than raw model scale alone.

 

5.2. Alternative Architectures: MIT’s Compositional (HiP) vs. Monolithic Models

 

A fundamental architectural debate is emerging in the field, questioning whether the best path forward is to build a single, massive, end-to-end model or to combine multiple, specialized models in a modular fashion. Google’s RT-2 and Physical Intelligence’s π0 represent the monolithic approach, betting on the power of end-to-end training to unlock emergent capabilities. A compelling alternative is the compositional approach, exemplified by the Hierarchical Planning (HiP) framework from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).35

The HiP framework decomposes the complex problem of long-horizon robot planning into three distinct stages, each handled by a separate, pre-existing foundation model 35:

  1. A Large Language Model acts as a symbolic reasoner, using its common-sense knowledge to break down a high-level goal (e.g., “make a cup of tea”) into an abstract sequence of steps.
  2. A Large Video Diffusion Model, trained on internet footage, acts as a “world model,” taking the abstract plan and grounding it in physical and geometric reality by generating a plausible sequence of visual observations.
  3. An Egocentric Action Model takes the generated visual plan and translates it into concrete, executable actions for the robot based on its first-person view of the environment.

This compositional philosophy offers two key advantages over the monolithic approach. First, it obviates the need for large, expensive datasets of perfectly paired vision, language, and action data, as each component model can be trained on different, more readily available data modalities.35 Second, it makes the robot’s decision-making process more transparent and interpretable, as the reasoning at each level of the hierarchy can be inspected.35 This architectural divergence represents a strategic fork in the road for AI robotics. The success of monolithic models hinges on the continued scaling of data and computation, while the success of compositional models depends on developing robust interfaces for combining specialized AI agents. The outcome of this debate will shape the future architecture of all complex AI systems.

 

5.3. Commercial Ventures: The Ambitions of Physical Intelligence (π0)

 

The rapid progress in research has spurred significant commercial investment, with startups aiming to build the definitive foundation model for physical intelligence. A leading example is Physical Intelligence, co-founded by pioneering robotics researcher Sergey Levine. Their first prototype model, π0 (pi-zero), is a general-purpose VLA model that shares a similar philosophy with RT-2, starting from a pre-trained VLM and fine-tuning it on a large, multi-robot dataset.36

However, π0 introduces a key architectural distinction. While models like OpenVLA and RT-2 often rely on discretizing the action space into a finite set of tokens, π0 uses a diffusion-based technique called flow matching to generate continuous, high-frequency motor commands.37 The company argues that this approach is better suited for the kind of fluid, dexterous manipulation required for complex real-world tasks. In head-to-head evaluations on challenging, multi-stage tasks—such as folding a shirt or bagging groceries—π0 has been shown to dramatically outperform both OpenVLA and Octo.37 These results suggest that while semantic understanding from VLMs is critical, innovations in the action generation mechanism are equally important for unlocking advanced physical capabilities.

 

5.4. Academic Influence: Contributions from Stanford and Carnegie Mellon University

 

The conceptual and theoretical underpinnings of this field are heavily influenced by leading academic institutions.

  • Stanford University’s Center for Research on Foundation Models (CRFM) has played a pivotal role in shaping the discourse. By coining the term “foundation model,” the center provided a unifying conceptual framework that connected disparate research threads in NLP, vision, and other domains.6 Their influential reports have articulated the core principles, opportunities, and risks of this new paradigm, guiding both technical research and policy discussions.4
  • Carnegie Mellon University’s Robotics Institute (RI), a historic leader in robotics, is focused on the practical challenges of integrating these powerful models into robust robotic systems. Research at CMU explores how foundation models can be incorporated into classic robotics pipelines to improve task specification and scene modeling.39 Critically, CMU researchers also emphasize areas that vision-centric models often overlook, such as the indispensable role of
    tactile sensing for dexterous manipulation. They argue that a true understanding of physical interaction requires grounding in touch, not just vision, a key missing piece in many current foundation models.40

 

Section 6: Critical Challenges on the Path to Deployment

 

Despite the remarkable progress demonstrated by models like RT-2 and its contemporaries, the path from research breakthrough to widespread, reliable deployment of general-purpose robots is fraught with significant challenges. Overcoming these technical, safety, and ethical hurdles is the central focus of current and future research in the field. These are not minor engineering problems but deep, fundamental questions that must be addressed to unlock the full potential of this technology responsibly.

 

6.1. Technical Hurdles: Data, Sim-to-Real, and Real-Time Inference

 

While the VLA paradigm has cleverly mitigated the need for common-sense knowledge in robotics data, a new set of technical bottlenecks has come into focus.

  • Data Scarcity and Quality: The problem of data has not been solved; it has shifted. The performance of a robot on physical tasks remains fundamentally limited by the distribution of skills present in its embodied training data.7 While web data provides the “what” and “why,” the robot still needs to learn the “how” from physical examples. The new bottleneck is therefore not just the quantity of data, but its
    diversity and quality. This includes capturing a wider range of dexterous manipulation skills and, crucially, incorporating richer sensory modalities. Current models are overwhelmingly reliant on vision and language. A robust understanding of the physical world requires integrating other essential data streams like tactile and force feedback, which are completely absent from web data and provide critical information about contact, friction, and object properties.13
  • Sim-to-Real Transfer: The gap between simulation and reality persists as a major impediment to scalable training.7 While simulators are improving, they still struggle to model the full complexity of real-world physics and sensor data, limiting the effectiveness of policies trained purely in simulation.
  • Real-Time Performance: The computational demands of large foundation models pose a critical bottleneck for real-time robotic control. The high inference latency of these models makes it difficult to achieve the high-frequency control loops necessary for dynamic and reactive tasks.8 This challenge is particularly acute for mobile robots that operate under strict Size, Weight, and Power (SWaP) constraints, which cannot accommodate large, power-hungry GPUs. Research into techniques like model distillation—training smaller, more efficient models to mimic the behavior of larger ones—is essential for deploying this intelligence on field-deployable platforms.41
  • Multimodal Integration: The seamless integration of diverse sensory modalities remains an open research problem. Unifying sight, sound, touch, and proprioception into a coherent representation of the world is a key step towards more human-like physical intelligence, but current architectures are still in the early stages of tackling this complexity.8

 

6.2. Safety and Reliability: Ensuring Predictable Behavior in Unstructured Environments

 

For a general-purpose robot to be trusted in human environments, it must be safe and reliable. The very nature of foundation models, however, introduces a fundamental tension between generality and safety. The goal of a general-purpose system is to adapt and behave intelligently in novel situations for which it was not explicitly trained. Yet, safety engineering has traditionally relied on formal verification and exhaustive testing of a system’s behavior within a well-defined operational domain. One cannot exhaustively test a system whose primary feature is its ability to generate novel behaviors.

This dilemma highlights several critical areas of research:

  • Uncertainty Quantification: A safe robot must “know what it doesn’t know.” Models must be able to accurately quantify their own uncertainty when faced with unfamiliar inputs or situations, allowing them to fail gracefully, ask for help, or refuse to perform a task rather than executing a potentially dangerous action.8
  • Robustness and Corner Cases: The real world is characterized by a “long tail” of rare and unexpected events. Ensuring that a robot behaves predictably and safely when confronted with these corner cases—an object slipping, a person suddenly entering its workspace—is perhaps the most difficult challenge in robotics.7
  • Human Oversight and Intervention: In the foreseeable future, robust human-in-the-loop systems will be essential for safe deployment. These systems must allow for effective monitoring, timely intervention, and intuitive methods for correcting robot behavior, ensuring that ultimate control remains in human hands.13 This requires a paradigm shift in safety engineering, moving from a focus on pre-deployment verification to one centered on runtime monitoring, anomaly detection, and the development of “ethical governors” that can constrain the actions of a powerful but not fully predictable intelligence.

 

6.3. Ethical and Societal Implications: Bias, Accountability, and Workforce Transformation

 

The deployment of autonomous, general-purpose robots into society raises profound ethical questions that extend far beyond technical implementation. These systems will interact with people, make decisions that affect them, and reshape economic and social structures.

  • Algorithmic Bias: Foundation models trained on vast corpuses of internet data are known to absorb and potentially amplify the societal biases present in that data. A robot powered by such a model could manifest these biases in its physical interactions or decisions, leading to discriminatory or stereotypical behavior based on gender, race, or culture.42
  • Accountability and Liability: When an autonomous robot causes harm, determining responsibility becomes a complex legal and ethical puzzle. Is the manufacturer of the hardware liable? The developer of the foundation model? The company that fine-tuned the model for a specific application? Or the end-user who gave the command? Current legal frameworks for liability are not designed for systems that learn and adapt, and establishing clear lines of accountability is a critical prerequisite for deployment.45
  • Privacy and Surveillance: Robots operating in homes, hospitals, and public spaces are effectively mobile sensor platforms, equipped with cameras, microphones, and other sensors capable of collecting vast amounts of sensitive data. This creates significant risks for individual privacy and raises concerns about the potential for pervasive surveillance.43
  • Workforce Displacement: A primary economic driver for general-purpose robotics is the automation of physical labor. While this promises increased productivity and the elimination of dangerous or tedious jobs, it also raises significant concerns about large-scale workforce displacement, economic inequality, and the need for massive societal investment in retraining and education programs.42
  • Human-Robot Interaction and Emotional Attachment: As robots become more sophisticated and capable of natural interaction, particularly in roles like elder care or companionship, there is a risk of users, especially vulnerable ones, forming unhealthy emotional dependencies. Designing these interactions ethically requires careful consideration of the psychological impact on humans and maintaining transparency about the robot’s nature as a machine.45

 

Section 7: Strategic Implications and Future Outlook

 

The rapid advancements in robot foundation models, exemplified by the trajectory from RT-1 to RT-2, are not just academic curiosities; they are harbingers of a profound technological and economic shift. These models are the enabling technology that could finally move general-purpose robots from the realm of science fiction into the fabric of our daily lives. The strategic implications for industry, research, and society are immense, and the future trajectory of this technology is beginning to take shape.

 

7.1. The Trajectory of General-Purpose Robotics: From Manipulators to Humanoids

 

The current generation of leading robot foundation models—including the RT series, Octo, and π0—is primarily focused on and demonstrated with robotic manipulation arms. This is a logical starting point, as manipulators are prevalent in industrial settings and provide a constrained yet complex environment for developing core capabilities. However, the ultimate vision for a truly general-purpose robot that can operate seamlessly in human-centric environments is increasingly converging on the humanoid form factor.49

Our world—from our tools and doorways to our countertops and vehicles—is designed for the human body. A robot with a humanoid morphology, including legs for ambulation and dexterous hands for manipulation, would not require us to redesign our environment to accommodate it.49 The recent surge in investment and high-profile development in humanoid robots by companies like Figure AI, Sanctuary AI, and Tesla is directly enabled by the progress in foundation models. These advanced AI systems provide the “brain” that is necessary to make a complex humanoid body useful and adaptable. Without a general-purpose intelligence to control them, these sophisticated machines would remain little more than teleoperated puppets.

This does not mean the future belongs exclusively to humanoids. A robust debate continues regarding the merits of general-purpose platforms versus specialized, purpose-built robots. Proponents of the latter argue that for many tasks, such as last-mile delivery or warehouse logistics, a simpler, more efficient, and more reliable robot designed for a specific function will always outperform a complex, generalist humanoid.50 The future of robotics will likely not be a monoculture but a diverse ecosystem containing both highly specialized automated systems and increasingly capable general-purpose humanoids, each occupying the niches for which they are best suited.

 

7.2. Investment and Commercialization Pathways

 

The tangible progress in robot foundation models has catalyzed a significant influx of capital into the sector. In 2025 alone, over $2.2 billion was invested into startups focused on this technology, signaling strong confidence from the investment community in its commercial potential.51

The initial and most immediate commercialization pathways are in industrial and enterprise domains where a clear business case for automation exists:

  • Logistics and Warehousing: Automating tasks like picking, packing, sorting, and palletizing to address labor shortages, increase efficiency, and handle the ever-growing volume of e-commerce.52
  • Manufacturing: Performing complex assembly, machine tending, and quality inspection tasks that require more adaptability and dexterity than traditional industrial robots can provide.53

Looking further ahead, the applications will expand into service industries and eventually the consumer market. Key future growth areas include healthcare and elder care, where robots can assist with patient mobility, monitoring, and daily tasks; retail, for stocking shelves and customer assistance; and ultimately, the home, where a general-purpose robot could perform household chores like laundry, cleaning, and cooking.49

A critical analysis of this market trajectory suggests the emergence of a new business model: the “Robotic Brain as a Service” (RBaaS). The immense cost, computational power, and data required to develop and continuously improve a state-of-the-art robot foundation model create an extremely high barrier to entry.6 This market structure naturally favors a consolidation around a few major providers who can afford this investment. These companies are unlikely to also become experts in manufacturing every type of robotic hardware. This leads to a logical decoupling of the robot’s “brain” (the AI model) from its “body” (the physical hardware). In this future, a hardware manufacturer like Boston Dynamics or Figure AI might sell a physical humanoid, while the customer licenses the AI operating system from a company like Google (“Powered by Gemini Robotics”) or Physical Intelligence. This platform-based approach would mirror the evolution of the personal computer and smartphone markets (Hardware + OS), accelerating adoption, creating industry standards, and making the AI model, not the hardware, the primary source of value and competitive differentiation.

 

7.3. Recommendations for Future Research and Development Priorities

 

To continue the current pace of innovation and responsibly navigate the challenges ahead, the research and development community should prioritize several key areas:

  • Data Ecosystems: The most critical need is for larger, more diverse, and more accessible datasets of physical interaction. This requires a concerted effort to foster collaborative, open data-sharing initiatives like the Open X-Embodiment project. Future datasets must move beyond just vision and language to include rich, synchronized tactile, force, and audio data, which are essential for dexterous manipulation and robust physical understanding.
  • Architectural Synthesis: The debate between monolithic and compositional architectures is a fruitful area for research. Future breakthroughs may lie in hybrid systems that combine the strengths of both: the powerful emergent reasoning of large, end-to-end trained models with the transparency, modularity, and reliability of compositional approaches.
  • Safety, Alignment, and Trust: Research into safety must become a first-order priority, not an afterthought. This includes developing new techniques for robust uncertainty quantification, formal verification of learned policies, and effective human-in-the-loop oversight specifically for embodied agents whose actions have real-world consequences.
  • Hardware-Software Co-design: The next generation of robotic capabilities will be unlocked by a tighter integration between hardware and software. Foundation models should not just be retrofitted onto existing robots; instead, new robot hands, sensors, and actuators should be designed in tandem with the AI models that will control them, creating a virtuous cycle of co-optimization that enhances overall system performance.

By focusing on these priorities, the field can continue its rapid progress towards the long-standing goal of artificial intelligence: creating truly intelligent machines that can perceive, understand, and act capably and helpfully in the physical world.