1. Bibliographic Information

1.1. Title

Robotic computing system and embodied AI evolution: an algorithm-hardware co-design perspective

1.2. Authors

The authors of this paper are:

Longke Yan
Xin Zhao
Bohan Yang
Yongkun Wu
Guangnan Dai
Jiancong Li
Chi-Ying Tsui
Kwang-Ting Cheng
Yihan Zhang
Fengbin Tu

Most authors are affiliated with the AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong, China. Longke Yan is pursuing a Ph.D. at the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China, with research interests including computer architecture, VLSI design, embodied AI, and algorithm-hardware co-design. Fengbin Tu, an Assistant Professor at the Department of Electronic and Computer Engineering and Associate Director of the Institute of Integrated Circuits and Systems at The Hong Kong University of Science and Technology, has extensive experience in AI chips, computing-in-memory, computer architecture, and reconfigurable computing.

1.3. Journal/Conference

This paper is published in the Journal of Semiconductors. This journal is a peer-reviewed publication focusing on the latest advancements in semiconductor science and technology, indicating that the paper has undergone a review process by experts in the field. Its publication in this journal suggests its relevance and contribution to the fields of semiconductor technology, computing systems, and AI hardware.

1.4. Publication Year

2025

1.5. Abstract

The abstract information is missing from the provided PDF first-page text. However, based on the paper's content, the abstract would likely introduce the rapid evolution of robotics and embodied AI, highlighting the increasing demands on computing systems. It would then propose algorithm-hardware co-design as a critical methodology to balance performance metrics like accuracy, latency, and power consumption. The paper would outline a roadmap from traditional robotics to hierarchical and end-to-end embodied AI models, discussing associated algorithmic advancements and specialized hardware accelerators. Finally, it would likely conclude by identifying future challenges and research opportunities for next-generation robotic computing systems through cross-stack co-design.

1.6. Original Source Link

/files/papers/69293dc1ba903910b6a9733b/paper.pdf (This appears to be an internal or relative link, not a full external URL. Based on the doi provided in the paper, the official online view is https://doi.org/10.1088/1674-4926/25020034). The paper is officially published.

2. Executive Summary

2.1. Background & Motivation

The field of robotics has made significant progress, enabling machines to perform complex tasks autonomously or collaboratively. A typical robotic system consists of a physical body and an intelligent agent (brain) comprising sensory, computing, and actuation subsystems. The computing system, integrating intelligent algorithms and supporting hardware, is pivotal for functionality and performance.

Recent advancements in robotic algorithms, particularly in areas like computer vision, motion planning (e.g., $RRT*$ ), and control (e.g., MPC), have dramatically improved intelligence and accuracy. However, these advanced algorithms, especially with the rise of embodied AI (where robots leverage data-driven AI models like Transformers and diffusion models), place immense computational demands on hardware. General-purpose computing hardware like CPUs and GPUs often struggle to meet the stringent real-time, energy-efficiency, and power-constrained requirements of robotic systems, leading to a significant performance-efficiency gap. For example, $RRT*$ on CPUs can take seconds, while GPUs consume hundreds of watts, making them impractical for mobile robots or drones. This gap is exacerbated by large-scale embodied AI models with billions of parameters.

The core problem the paper aims to solve is this performance-efficiency gap in robotic computing systems, especially for embodied AI, due to the mismatch between algorithm demands and general-purpose hardware capabilities. This problem is crucial for enabling the widespread deployment of intelligent, autonomous robots in various constrained environments (e.g., mobile robots, drones, manufacturing floors). The paper's entry point is algorithm-hardware co-design, a methodology that analyzes algorithm computational behaviors on hardware and optimizes them through both algorithm refinement and specialized hardware innovation to achieve substantial system-wide benefits.

2.2. Main Contributions / Findings

As a comprehensive survey paper, its main contributions are:

Comprehensive Overview: It provides a systematic and detailed overview of robotic computing systems from both algorithmic and hardware perspectives, covering traditional robotics and the emerging embodied AI.
Evolutionary Roadmap: It proposes a clear three-step roadmap for the evolution of embodied AI, moving from traditional robotics (model-driven) to hierarchical models (cognitive planning + action execution) and ultimately to end-to-end models (vision-language-action). This roadmap helps contextualize current and future research.
Algorithm-Hardware Co-design Emphasis: It rigorously highlights and demonstrates the critical role of algorithm-hardware co-design as the primary methodology to address the performance-efficiency gap. It illustrates how inherent properties in robotics algorithms (e.g., parallelism, locality, sparsity, similarity) create opportunities for co-design, leading to enhanced performance and efficiency.
Detailed Survey of Co-design Works: It extensively reviews recent works on both traditional robotic algorithms (perception, planning, action mapping, control) and embodied AI algorithms (Transformers, 3D reconstruction, diffusion models), showcasing specific examples of algorithm-hardware co-design techniques and their benefits.
Future Challenges and Opportunities: It identifies two primary challenges for embodied AI computing systems: (1) adapting computing platforms to the rapid evolution of embodied AI algorithms, and (2) transforming the potential of emerging hardware innovations (e.g., 3D IC, Chiplet, CIM) into end-to-end inference improvements. It then proposes a three-layer technology stack (software toolchain, hardware architecture, algorithms) and corresponding opportunities for future research.

The key conclusion is that algorithm-hardware co-design is indispensable for realizing real-time, energy-efficient, and intelligent robotic computing systems and embodied AI, especially at the edge. The paper's findings help researchers understand the current state, major challenges, and promising directions for developing next-generation robotic platforms.

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following foundational concepts:

Robotic System Architecture:
- Physical Entity ("Body"): The physical platform of a robot, which interacts with the physical world. Examples include manipulator arms, humanoid robots, quadrupeds, or drones.
- Intelligent Agent ("Brain"): The computational and control core of the robot. It comprises three fundamental subsystems:
  - Sensory System: Collects data about the robot's internal state and the external environment using sensors like cameras, LiDARs (Light Detection and Ranging, a remote sensing method that uses pulsed laser to measure ranges), etc.
  - Computing System: The focus of this paper, it integrates intelligent robotic algorithms and supporting hardware to process sensory data and dictate robotic actions.
  - Actuation System: Enables the robot's movement through components like servomotors, drives, and transmissions.
- Degrees of Freedom (DOF): The number of independent parameters that define the configuration of a mechanical system. For robots, DOF often refers to the number of joints that can move independently.
Algorithm-Hardware Co-design: A methodology that involves analyzing the computational characteristics of an algorithm (e.g., parallelism, data locality, sparsity, data similarity) and concurrently optimizing both the algorithm and the underlying hardware to achieve superior system-wide performance, power efficiency, and cost. It's an iterative process where algorithm optimizations inform hardware design, and hardware capabilities influence algorithm development.
Artificial Intelligence (AI) / Machine Learning (ML) Models:
- Convolutional Neural Networks (CNNs): A class of deep neural networks predominantly used for analyzing visual imagery. They employ convolutional layers (mathematical operations that apply a filter to an input to produce a feature map), pooling layers (down-sampling operations), and fully connected layers to learn hierarchical features.
- Transformers: A neural network architecture introduced in 2017, primarily designed for sequence-to-sequence tasks but now widely applied to various domains including vision and robotics. Its key innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element.
- Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process (adding noise to data) to generate new data samples. They are powerful for generating high-quality images and are increasingly used for sequence generation tasks like robot actions.
- Large Language Models (LLMs): Very large neural networks trained on massive text datasets, capable of understanding, generating, and processing human language. Examples include ChatGPT, PaLM.
- Vision-Language Models (VLMs): Multimodal AI models that can process and understand information from both visual (images, video) and textual inputs. They can perform tasks like image captioning, visual question answering, and, in embodied AI, understand commands and perceive environments.
Hardware Technologies:
- Central Processing Unit (CPU): The "brain" of a computer, responsible for executing instructions and performing calculations. General-purpose but can be slow for highly parallel tasks.
- Graphics Processing Unit (GPU): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images. Highly parallel architecture, excellent for deep learning computations, but can be power-intensive.
- Field-Programmable Gate Array (FPGA): An integrated circuit that can be configured by a customer or designer after manufacturing. Offers high flexibility and energy efficiency for specific tasks, bridging the gap between general-purpose CPUs/GPUs and fixed-function ASICs.
- Application-Specific Integrated Circuit (ASIC): An integrated circuit customized for a particular use rather than general-purpose use. Offers the highest performance and energy efficiency for its specific task but lacks flexibility.
- System-on-Chip (SoC): An integrated circuit that integrates all components of a computer or other electronic system into a single chip. It typically includes CPU, GPU, memory, input/output ports, and specialized accelerators.
- Compute-in-Memory (CIM): An emerging hardware paradigm that aims to overcome the Von Neumann bottleneck (the bottleneck between the processor and memory) by performing computation directly within or very close to memory, reducing data movement and improving energy efficiency.
- 3D Integrated Circuit (3D IC): Stacking multiple silicon wafers or dies vertically and interconnecting them with Through-Silicon Vias (TSVs). This technology offers higher integration density, shorter interconnects, and higher bandwidth.
- Chiplet: An approach to chip design where multiple small, specialized chips (chiplets) are integrated into a single package, connected by high-speed links. This allows for modular design, mix-and-match functionality, and potentially larger, more powerful systems than monolithic chips.

3.2. Previous Works

The paper extensively references prior studies across both traditional robotics and embodied AI, highlighting the progression towards more intelligent and efficient systems.

Traditional Robotics Algorithms:
- Perception:
  - Detection: RCNN series (two-stage, CNN-based) and YOLO series (single-stage, faster).
  - Segmentation: UNet (upsampling for map restoration) and DeepLab (dilated convolutions for multi-scale context).
  - Depth Estimation: Local methods (e.g., SAD) and Global methods (e.g., Semi-Global Matching (SGM)) for stereo matching.
  - Localization: Simultaneous Localization and Mapping (SLAM) (constructs maps while localizing) and Visual-Inertial Odometry (VIO) (calculates relative poses using Kalman filtering).
- Task Planning: Finite State Machines (FSMs) (for structured environments, e.g., DARPA Urban Challenge) and Partially Observable Markov Decision Process (POMDP) (for uncertainty, e.g., Hidden Markov Model (HMM) for intention prediction).
- Motion Planning:
  - Graph-search-based: Dijkstra and $A*$ (heuristic extension).
  - Sampling-based: Probabilistic Roadmap Method (PRM) and Rapidly-exploring Random Tree (RRT) and its variants like $RRT*$ (outstanding efficiency in large environments).
- Action Mapping:
  - Kinematics: Forward Kinematics (FK) and Inverse Kinematics (IK) (analytic methods, numerical methods like Jacobian inverse).
  - Dynamics: Inverse and Forward Dynamics (crucial for MPC, often using templating and code generation).
- Control:
  - Feedback control without prediction: Proportional-Integral-Derivative (PID) control (simple, efficient).
  - Feedback control with prediction: Model Predictive Control (MPC) (predicts future behavior, optimizes control sequence).
  - Multi-task control: Whole-body control (closed-form methods, optimization methods like quadratic programming).
Embodied AI Algorithms:
- Cognitive Planning Models (leveraging LLMs/VLMs):
  - ChatGPT for Robotics (Microsoft): Integrates ChatGPT with robotic function libraries for natural language task decomposition.
  - Inner Monologue: Combines LLM-based planning with real-time environmental feedback for dynamic reasoning.
  - PaLM-E (Google): A VLM (540-billion-parameter PaLM + 22-billion-parameter ViT) trained end-to-end on robotic data for multimodal reasoning and long-horizon planning.
- Action Execution Models (specialized for low-level control):
  - Pose Prediction (often involving 3D Reconstruction):
    - PointNet: Foundational for point cloud processing.
    - ASGrasp (Samsung): Utilizes multi-level convolutional GRU network for transparent object reconstruction.
    - GraspNeRF and YOSO: Examples of NeRF-based and Transformer-based grasping models.
    - GaussianGrassper: Employs 3D Gaussian Splatting (3D GS) for scene reconstruction.
  - Action Generation (direct action output):
    - Autoregression-based (Transformers): RT-1 (Google, 35M parameters, EfficientNet-B3 backbone, action tokenization), ALOHA (Transformer encoder-decoder, 80M parameters, ResNet18 backbones).
    - Diffusion-based: Diffusion Policy (first to apply DDPMs to robot action space), UniP, AvDC (conditional diffusion for video generation), RDT-1B (diffusion foundation model, Transformer architecture).
- Hierarchical Frameworks (combining cognitive planning and action execution):
  - SkillDiffuser: Uses GPT-2 for skill abstraction and diffusion model for trajectory generation.
  - $\pi_0$ (Physical Intelligence): Pre-trained VLM for cognitive planning, diffusion-variant flow-matching model for action execution.
  - GO-1 (AgiBot): Pre-trained VLM, latent planner, action expert.
  - GR00T N1 (NVIDIA): VLM-based System 2, Diffusion Transformer Module-based System 1.
  - Helix (Figure): 7B-VLM-based System 2, 80M-Transformer-based System 1.
- End-to-End Models (integrating all into one VLA):
  - RT-2 (Google): Integrates web-scale vision-language pretraining into robotic action generation (ViT to project images into language token space).
  - RT-X: Builds upon RT-2 using the Open X-Embodiment dataset for generalization.
  - OpenVLA (Stanford): State-of-the-art, open-source VLA (Prismatic-7B VLM backbone: ViT visual encoder, MLP projector, Llama2 backbone).
Core Formulas (for prerequisite knowledge, especially Transformers): While the paper itself is a survey and does not introduce new formulas, understanding the self-attention mechanism is crucial for Transformers, which are foundational for many embodied AI models. The original Attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequences, typically derived from the same input embedding through linear transformations.
- $Q^T$ is the transpose of the Key matrix.
- $d_k$ is the dimension of the Key vectors. Dividing by $\sqrt{d_k}$ is a scaling factor to prevent the dot products from growing too large, which can push the softmax function into regions with very small gradients.
- $\frac{QK^T}{\sqrt{d_k}}$ computes the similarity scores between Queries and Keys, scaled.
- $\mathrm{softmax}(\cdot)$ is the softmax function, which normalizes the scores to produce a probability distribution, indicating how much attention each Value should get.
- Multiplying by $V$ produces the output, where each output element is a weighted sum of the Value vectors, with weights determined by the softmax scores.

3.3. Technological Evolution

The paper illustrates a clear technological evolution in robotic computing systems, primarily driven by advancements in AI and hardware capabilities.

Traditional Robotics (Pre-2022, Model-Driven):
- Algorithms: Characterized by explicit, model-driven approaches. Perception relied on rule-based computer vision or early CNNs. Planning used methods like $A*$ , $RRT*$ . Control used PID or basic MPC. These methods often required precise models of the robot and environment, limiting their generality.
- Hardware: Primarily relied on general-purpose CPUs and GPUs, with some early FPGA or ASIC acceleration for bottlenecks. The focus was on optimizing specific, well-defined algorithms.
Emergence of Embodied AI - Hierarchical Models (2022-2023):
- Motivation: Breakthroughs in Transformers and diffusion models (for long sequence/high-dimensional data) and LLMs/VLMs (for natural language understanding, multimodal perception, and long-horizon planning).
- Architecture: A shift towards hierarchical structures.
  - Higher Level: Cognitive planning models (e.g., LLMs/VLMs like ChatGPT for Robotics, PaLM-E) handle high-level reasoning, task decomposition, and long-horizon planning based on natural language instructions and multimodal perception. These models are often large.
  - Lower Level: Action execution models (e.g., Transformer-based RT-1, ALOHA, or diffusion model-based Diffusion Policy) handle dexterous, low-level action control with faster inference speeds, trained on specific embodiment data.
- Hardware Implications: Increased demand for specialized accelerators for Transformers and diffusion models, as well as efficient handling of the communication and frequency mismatch between high-level (slower, complex) and low-level (faster, real-time) components (e.g., Helix's hardware-aware training).
Future of Embodied AI - End-to-End Models (2023-Present & Future):
- Motivation: The ultimate goal is to integrate perception, planning, and action into a single, unified model for general-purpose robots and Artificial General Intelligence (AGI).
- Architecture: Vision-Language-Action (VLA) models (e.g., RT-2, OpenVLA) that directly map multimodal inputs (vision, language) to robotic actions. These models are typically large, often built on pre-trained VLMs fine-tuned with robotic data.
- Hardware Implications: Extreme computational and data throughput demands, pushing the limits of existing hardware and necessitating advanced algorithm-hardware co-design strategies, including emerging technologies like CIM, 3D IC, and Chiplets, to achieve real-time, energy-efficient performance at the edge.
  
  The paper fits into this timeline by surveying the current state of co-design efforts at each stage, identifying the transition points, and forecasting future directions.

3.4. Differentiation Analysis

As a survey paper, its core differentiation lies not in proposing a new method, but in its comprehensive and unique perspective on the intersection of robotics, AI, and hardware.

Algorithm-Hardware Co-design Focus: Unlike many surveys that focus solely on robotic algorithms or hardware advancements, this paper explicitly and consistently frames the discussion through the lens of algorithm-hardware co-design. It meticulously connects algorithmic properties (parallelism, sparsity, locality, similarity) to opportunities for hardware innovation, and vice-versa. This integrated perspective is crucial for understanding the true challenges and solutions in high-performance, energy-efficient robotics.
Evolutionary Roadmap for Embodied AI: The paper provides a structured roadmap for the evolution from traditional robotics to hierarchical embodied AI and end-to-end embodied AI. This framework helps organize the vast landscape of embodied AI research and clarifies the progression of complexity in both algorithms and hardware demands.
Bridging Traditional and Embodied AI: The survey effectively bridges the gap between traditional robotics techniques (e.g., $RRT*$ , MPC, SLAM) and the cutting-edge of embodied AI (Transformers, diffusion models, LLMs). It shows how co-design principles apply to both, and how the demands of the latter significantly amplify the need for specialized hardware.
Detailed Hardware Acceleration Examples: For each category of algorithms (perception, planning, action, control, and embodied AI components like Transformers, 3D reconstruction, diffusion models), the paper provides concrete examples of hardware acceleration techniques and specialized architectures (e.g., CIM, FPGA designs, ASICs, SoCs). This level of detail makes it a valuable resource for hardware architects.
Forward-Looking Perspective: Beyond summarizing existing work, the paper articulates clear future challenges and proposes a three-layer technology stack for addressing them, offering actionable directions for future research in software toolchains, hardware architectures, and algorithms.

In summary, this paper differentiates itself by offering an integrated, evolutionary, and forward-looking analysis of robotic computing systems through the critical paradigm of algorithm-hardware co-design, which is essential for realizing the full potential of embodied AI.

4. Methodology

This paper is a comprehensive survey, and as such, its methodology is primarily its structured approach to analyzing and presenting existing research. The core idea is to demonstrate how algorithm-hardware co-design is crucial for the evolution of robotic computing systems and embodied AI. This is achieved by systematically reviewing algorithms and their corresponding hardware acceleration techniques across different stages of robotic intelligence.

4.1. Principles

The fundamental principle guiding this survey is the belief that achieving an appropriate balance among system metrics like accuracy, latency, and power consumption in robotic computing systems—especially with the rise of embodied AI—requires algorithm-hardware co-design. This methodology involves:

Analyzing Algorithm Computational Behaviors: Identifying common properties within algorithms such as parallelism, locality, sparsity, and similarity.
Algorithm Optimization: Introducing optimizations to algorithms to better leverage these inherent properties, thereby reducing computing and storage complexity and exposing opportunities for hardware support.
Hardware Innovation: Developing specialized hardware from architecture to circuits that is tailored to exploit these algorithmic properties, leading to substantial system-wide benefits.

The paper argues that by aligning the strengths of both algorithm design and hardware engineering, algorithm-hardware co-design facilitates enhanced performance and efficiency for robotic computing systems.

4.2. Core Methodology In-depth (Layer by Layer)

The paper structures its analysis into several layers, moving from the foundational concepts of traditional robotics to the cutting edge of embodied AI, always emphasizing the algorithm-hardware co-design perspective.

4.2.1. Overall Algorithm-Hardware Co-design Framework

The paper introduces the algorithm-hardware co-design as a central theme, illustrated in Figure 2 of the original paper. The following figure (Figure 2 from the original paper) shows the algorithm-hardware co-design framework:

该图像是一个示意图，展示了机器人计算系统与具身人工智能进化中的算法与硬件协同设计。图中分为机器人算法、算法优化和硬件创新三个部分，突出并行性、数据局部性、稀疏性和数据相似性在协同设计中的作用。

This diagram illustrates that robotic algorithms inherently exhibit common computational behaviors such as parallelism, locality, sparsity, and similarity. These properties create opportunities for:

Algorithm Optimization: Modifying algorithms to better exploit these properties (e.g., reducing compute and storage complexity, exposing hardware support opportunities).
Hardware Innovation: Designing hardware architectures and circuits that are specialized to efficiently execute these optimized algorithms (e.g., accelerators, CIM). The interplay between these two domains leads to substantial system-wide benefits (e.g., improved performance, energy efficiency).

4.2.2. Traditional Robotic System Architecture

The paper first deconstructs a typical robotic system into its functional components, as shown in Figure 3 of the original paper. The following figure (Figure 3 from the original paper) shows the overall architecture of a robotic computing and actuation system:

该图像是一个示意图，展示了机器人计算系统和执行系统的整体架构，包括感知系统、任务规划、运动规划、动作映射和控制模块的协同工作流程，体现了从传感信息到执行动作的闭环控制。

This architecture comprises:

Sensory System: Gathers data from the internal state of the robot and the external environment (e.g., cameras, LiDARs).
Computing System: The "brain" that integrates algorithms and hardware. It processes sensory data to make intelligent decisions and translate them into actions. This system is broken down into five functional tasks:
1. Perception: Processes sensory data to extract high-level information (e.g., objects, positions, maps).
2. Task Planning: Decides what actions the robot should take based on environmental information.
3. Motion Planning: Determines how to execute actions by generating collision-free trajectories.
4. Action Mapping: Translates planned paths/trajectories into physical robot actions using robot mechanism models (kinematics, dynamics).
5. Control: Sends command sequences or signals to actuators to ensure the robot follows reference actions in a feedback loop.
Actuation System: Executes the physical movements of the robot.

Each of these stages demands specialized algorithms and computing hardware for high accuracy, real-time performance, and energy efficiency.

4.2.3. Embodied AI Evolution Roadmap

The paper then traces the evolution of robotic computing systems and embodied AI through a three-step roadmap, depicted in Figure 1 and Figure 4 of the original paper. The following figure (Figure 4 from the original paper) shows the evolution from traditional robotics to end-to-end embodied AI:

该图像是一个示意图，展示了从传统机器人学到端到端体化AI的认知规划与动作执行模型的演变及其算法-硬件协同设计框架。

The roadmap outlines the transition:

Traditional Robotics (Model-Driven): Relies on predefined models and explicit programming for specific tasks.
Hierarchical Model (Cognitive Planning + Action Execution): A transitional stage where embodied AI models are introduced.
- Cognitive Planning Model: At a higher level, uses LLMs or VLMs for understanding, reasoning, and long-horizon planning. These models are often large and might operate at a lower frequency.
- Action Execution Model: At a lower level, uses specialized models (e.g., Transformers, diffusion models) for dexterous action control, operating at a higher frequency.
End-to-End Model (Vision-Language-Action - VLA): The ultimate goal, where perception, planning, and action are seamlessly integrated into a single model, enabling general-purpose tasks in real time with high intelligence.

This roadmap serves as the organizational backbone for discussing embodied AI algorithms and their hardware implications.

4.2.4. Detailed Analysis of Traditional Robotics Algorithms and Hardware (Section 2)

The paper dedicates a significant portion to detailing the algorithms and their hardware accelerators for each of the five functional tasks in traditional robotics.

4.2.4.1. Perception

Detection:
- Algorithms: Two-stage methods (e.g., RCNN series) prioritize accuracy by first generating region proposals and then classifying. Single-stage methods (e.g., YOLO series) prioritize speed by directly predicting categories and bounding boxes.
- Hardware Co-design: Accelerators optimize multiply-accumulate (MAC) operations within sensors (SoCs) or reduce computational overhead for multi-scale semantic feature extraction (MSFE) through parallel processing and data compression.
Segmentation:
- Algorithms: Semantic segmentation labels regions by class, instance segmentation distinguishes multiple objects of the same class. UNet and DeepLab are widely used, employing upsampling and dilated convolutions, respectively.
- Hardware Co-design: Accelerators focus on reducing data intensity and pixel-level computations. Examples include low-energy imaging devices with analog background subtraction and compute-in-memory (CIM) architectures for floating-point operations, reducing bandwidth and area.
Depth Estimation:
- Algorithms: Estimates environmental depth from stereo images by calculating disparities using stereo matching algorithms. Local methods (e.g., SAD) trade accuracy for speed; Global methods (e.g., Semi-Global Matching (SGM)) are more accurate but time-consuming.
- Hardware Co-design: Accelerating stereo matching is critical. SAD can be implemented on FPGAs with efficient line buffering. SGM processors use high-throughput pipelined architectures with dependency-resolving schemes and customized ultra-high bandwidth SRAM to exploit unique data behaviors.
Localization:
- Algorithms: Estimates robot pose (position and orientation). SLAM constructs maps while localizing, often formulated as constrained nonlinear optimization. VIO (and other odometry methods) calculate relative poses, typically using Kalman filtering, without constructing explicit maps.
- Hardware Co-design: For mobile robots, energy-efficient FPGA accelerators for SLAM backend optimization exploit data sparsity, locality, similarity, and pipeline opportunities. Fully integrated VIO accelerators on a single chip eliminate energy-consuming data transfers by using compression, sparsity, rescheduling, and parallelism. Unified localization algorithm frameworks identify common kernels and co-design front-end and back-end accelerators with task parallel processing, data locality memory schemes, and workload schedulers. Frameworks like Archytas automate accelerator generation from algorithm descriptions.

4.2.4.2. Task Planning

Algorithms: Determines a sequence of future motions. Initially human-expert driven, then Finite State Machines (FSMs) scheduled motions based on perception (e.g., DARPA Urban Challenge). Partially Observable Markov Decision Process (POMDP) and Hidden Markov Model (HMM) handle uncertainty and dynamic changes.
Hardware Co-design: Historically on microcontroller units (MCUs). Increasing intelligence demands higher computing power, leading to Systems-on-Chip (SoCs) (e.g., Tesla's Full Self-Driving (FSD) computer which integrates CPUs, ISPs, GPUs, and custom neural network accelerators).

4.2.4.3. Motion Planning

Algorithms: Generates collision-free trajectories.
- Global Planning: Finds shortest collision-free paths. Graph-search-based methods (e.g., $A*$ ) traverse state spaces serially. Sampling-based methods (e.g., PRM, RRT, $RRT*$ ) find trajectories by probabilistic sampling, faster for high-dimensional spaces.
- Local Planning: Refines trajectories to satisfy constraints (safety, physical limits), often formulated as optimization problems. Numerical methods, distributed optimization (e.g., ADMM), and data-driven methods are used.
Hardware Co-design:
- Global Planning: Accelerators exploit similarity patterns in $A*$ (RACOD) to predict future poses. PRM accelerators (Dadu) adopt hardware-friendly graph representations. RRT is optimized with parallel schemes and prune-and-reuse strategies for dynamic environments.
- Local Planning: GPU-based optimization solvers leverage data-independent and data-dependent parallelism for matrix operations. Factor graph accelerators (BLITZCRANK) exploit sparsity and incremental solving. Hardware generation frameworks (ORIANNA) automatically create customized accelerators.

4.2.4.4. Action Mapping

Algorithms: Translates mathematical paths into physical robot actions using robot mechanism models.
- Kinematics: Describes motion without forces. Forward Kinematics (FK) calculates end-effector poses from joint variables. Inverse Kinematics (IK) is the inverse, often more complex (analytic methods for low DOF, numerical methods like Jacobian inverse for high DOF).
- Dynamics: Describes relationship between forces/torques and motion/acceleration. Inverse and Forward Dynamics (and their gradients) are key kernels in MPC.
Hardware Co-design:
- IK: FPGA implementations use data type reduction and module sharing. Accelerators (Dadu) use speculation-based Jacobian transpose algorithms and exploit algorithmic parallelism.
- Dynamics: Robomorphic computing designs accelerators parameterized by robot morphology, exploiting parallelism and matrix sparsity. RoboShape uses topology patterns for scalable accelerators. Dadu-rbd provides multifunctional pipelines to optimize data locality and adapt to robot-specific sparsity.

4.2.4.5. Control

Algorithms: Ensures robots follow reference actions with feedback.
- Without Prediction: PID control (simple, efficient).
- With Prediction: MPC (predicts future behavior, optimizes control commands).
- Multi-task Control: Whole-body control for redundant DOFs (closed-form methods using pseudo-inverse matrices, or optimization methods like quadratic programming).
Hardware Co-design: For high control rates ( $> 1 \mathrm{kHz}$ ), hardware acceleration is necessary. Numerous works accelerate PID. MPC accelerators on FPGAs (for quadratic programming solvers) and ASICs (leveraging parallel and sparse natures with pruning strategies and physical model transformation) are developed.

4.2.5. Detailed Analysis of Embodied AI Algorithms and Hardware (Section 3)

The paper then transitions to the embodied AI era, detailing the algorithms and hardware acceleration for hierarchical and end-to-end models.

4.2.5.1. Hierarchical Model

Cognitive Planning Model:
- Algorithms: LLMs (e.g., ChatGPT for Robotics, Inner Monologue) for long-horizon task decomposition and reasoning from textual descriptions. VLMs (e.g., PaLM-E, Figure 5) integrate visual and language inputs for multimodal reasoning.
- Hardware Co-design: These models are large (billions of parameters), requiring efficient execution. The paper implicitly notes that Transformer accelerators (discussed later) are crucial here. The following figure (Figure 5 from the original paper) shows the PaLM-E model architecture overview:
  
  该图像是一张示意图，展示了利用视觉嵌入（emb）、视觉Transformer（ViT）和大规模语言模型（PaLM）进行任务问答的流程，体现了算法与硬件协同设计中的信息流。
Action Execution Model:
- Algorithms: Executes low-level actions.
  - Pose Prediction: Focuses on finding optimal final poses, often integrates 3D reconstruction (point cloud, NeRF, 3D GS). Examples include ASGrasp (using GSNet), AnyGrasp, GaussianGrasper, YOSO (using Point Transformer).
  - Action Generation: Directly outputs actions or action sequences. Autoregression-based models use Transformers (e.g., RT-1, ALOHA, Figure 6) for sequence modeling. Diffusion-based models (e.g., Diffusion Policy, Figure 7, UniP, AvDC, RDT-1B) represent action sequence generation as a conditional denoising diffusion process.
- Hardware Co-design: Requires faster inference. 3D reconstruction accelerators and diffusion model accelerators (discussed later) are critical. The following figure (Figure 6 from the original paper) shows the Action Chunking with Transformers (ACT) model architecture:
  
  该图像是图表，展示了采用变压器的动作分块模型架构，左侧为数据输入部分，包括不同摄像头的图像和特征提取，右侧为变压器解码器，处理动作序列和位置嵌入信息。
The following figure (Figure 7 from the original paper) illustrates the general formulation of the diffusion policy:

$该图像是示意图，展示了扩散策略的通用形式及其基于CNN和Transformer的具体实现结构，涉及观察输入、动作序列和条件嵌入等模块，含有公式$a \\cdot x + b$用于说明卷积操作。$ 该图像是示意图，展示了扩散策略的通用形式及其基于CNN和Transformer的具体实现结构，涉及观察输入、动作序列和条件嵌入等模块，含有公式 $a \cdot x + b$ 用于说明卷积操作。
Hierarchical Frameworks:
- Algorithms: Combine cognitive planning and action execution models (e.g., SkillDiffuser, $\pi_0$ , GO-1, GR00T N1 (Figure 8), Helix). These frameworks aim to leverage the strengths of both, handling complex tasks with multimodal perception, long-horizon planning, and dexterous actions.
- Hardware Co-design: Helix demonstrates hardware-aware training to mitigate temporal misalignment between high-frequency and low-frequency components. This highlights the ongoing need for dedicated co-designed accelerators. The following figure (Figure 8 from the original paper) shows the GR00T N1 model architecture overview:
  
  $Fig. 8. (Color online) GR00T N1 model architecture overview (from GR00T N1\[14\]).$ 该图像是示意图，展示了GR00T N1模型的架构，结合图像观察、语言指令和机器人状态来执行动作。图中显示了图像和文本的编码过程，以及通过扩散变换器生成的动作令牌，最终实现机器人动作的控制。

4.2.5.2. End-to-End Model

Algorithms: Aims to integrate perception, planning, and action into a single model. Vision-Language-Action (VLA) models (e.g., RT-2, RT-X, OpenVLA (Figure 9)) process multimodal inputs to generate robotic actions, showing capabilities in long-horizon planning and reasoning.
Hardware Co-design: These are highly resource-intensive due to their integrated nature and often large parameter counts. The following figure (Figure 9 from the original paper) shows the OpenVLA model architecture overview:

$Fig. 9. (Color online) OpenVLA model architecture overview (from OpenVLA\[17\]).$ 该图像是示意图，展示了OpenVLA模型架构的概述。图中左上角为输入图像和语言指令，描述机器人任务（如“将茄子放入碗中”）。数据流经MLP投影器和多个模块（DinoV2和SigLIP），最终到达包含Llama 2 7B和Action De-Tokenizer的核心。右侧展示了机器人执行的7D动作表示，包括位置、旋转和抓取参数（ $\Delta x, \Delta \theta, \Delta Grip$ ）。

4.2.5.3. Embodied AI Hardware

The paper identifies Transformers, 3D reconstruction, and diffusion models as primary computational bottlenecks across diverse embodied AI algorithms, driving the need for specialized accelerators.

Transformer Accelerator:
- Challenges: Self-attention mechanism has quadratic computational and memory complexity, especially for LLMs and high-resolution images.
- Hardware Co-design:
  - Digital Accelerators: Exploit sparsity patterns and data redundancy (Wang et al., Tambe et al. with entropy-based early exit, Kim et al. with C-Transformer and implicit weight generation) to reduce computation, memory access, and power.
  - CIM Accelerators: Integrate computation directly within memory. Examples include Tu et al. (reconfigurable streaming network, bitline-transpose CIM, sparse attention scheduler) and Guo et al. (hybrid analog-digital lightning-like CIM with compressed adder tree and analog-storage quantizer).
3D Reconstruction Accelerator:
- Challenges: Intensive spatial computation and irregular memory access.
- Hardware Co-design: Uses approximate computing and dataflow transformation.
  - Point Cloud Accelerators: Address irregular/sparse data. Examples include Im et al. (ToF sensor, window-based technique for conjugate gradient, feature reuse), Sun et al. (reconfigurable sparse convolution core, neighbor search circuit), Jung et al. (PRNG for neighbor search).
  - NeRF Reconstruction Accelerators: Address intensive MLP calculations and memory access. Examples include Han et al. (MetaVRain with spatial, temporal, top-down attention, periodic polynomial for positional encoding), Ryu et al. (NeuGPU with on-chip hash tables, hybrid interpolation, similarity sparsity), Park et al. (Space-Mate for NeRF-SLAM with out-of-order SMoE router and heterogeneous core).
  - 3D GS Reconstruction Accelerators: Address sorting and rasterization bottlenecks. Examples include Lee et al. (GSCore with Gaussian shape-aware intersection test, two-stage hierarchical sorting), Wu et al. (GauSPU for 3D GS and SLAM using sparse-tile-sampling).
Diffusion Model Accelerator:
- Challenges: Iterative denoising process leads to linear increase in computational complexity and memory access.
- Hardware Co-design: Focuses on optimizing dataflow across adjacent iterations using differential computation.
  - Digital Accelerators: Kong et al. (Cambricon-D) optimizes for nonlinear operators by using sign-mask dataflow and outlier-aware processing elements.
  - CIM Accelerators: Guo et al. (divides input variations into dense integer and sparse floating-point. Uses radix-8 Booth CIM for integer, reconfigurable 4-operand exponent CIM (4Op-ECIM) for floating-point, also configurable as CAM for sparse value skipping).
    
    This layered and detailed breakdown, consistently linking algorithms to their hardware acceleration strategies and underlying computational properties, forms the core methodology of the survey.

5. Experimental Setup

As a survey paper, "Robotic computing system and embodied AI evolution: an algorithm-hardware co-design perspective" does not present its own experimental setup, datasets, evaluation metrics, or baselines. Instead, it synthesizes and analyzes the experimental results and methodologies reported in hundreds of other research papers.

The paper's "experimental setup" is, in essence, its comprehensive literature review and structured analysis framework. It draws upon the diverse experimental methodologies from the referenced works to illustrate trends, challenges, and successful algorithm-hardware co-design implementations.

5.1. Datasets

The paper does not introduce or use its own datasets. It discusses various datasets used by the referenced papers, which are integral to understanding the context of the algorithms and hardware being reviewed. For instance:

RT-X (a referenced VLA model) builds upon the Open X-Embodiment dataset, a collaborative dataset aggregating data from 22 distinct robotic platforms across 21 institutions, encompassing 527 skills and 160,266 tasks. This dataset aims to enhance generalization across various robotic platforms and scenarios.
ALOHA (an Autoregression-based action generation model) learns from expert demonstrations of robot actions collected through teleoperation.
PaLM-E (a VLM) is trained end-to-end on multiple robotic tasks, leveraging VLM pretraining with fine-tuning on robotic data.
RDT-1B (a diffusion foundation model) is pretrained on a large-scale multi-robot dataset and finetuned on a self-created multi-task bimanual dataset.
Helix (a hierarchical framework) is trained on substantial teleoperation data ( $\sim 500 \ \mathsf{h}$ ).

The choice of these diverse datasets by the original researchers is effective for validating the performance of their methods across different robotic tasks, modalities (visual, language, proprioceptive), and generalization capabilities.

5.2. Evaluation Metrics

The paper does not define or use its own evaluation metrics. It refers to the performance metrics discussed or reported in the cited works to highlight the effectiveness of algorithm-hardware co-design. Common metrics implicitly referenced or mentioned in the context of acceleration include:

Accuracy: Refers to the correctness of task execution or prediction (e.g., detection accuracy, segmentation accuracy, grasping success rate).
Latency: The delay between an input and a corresponding output (e.g., inference time, control loop frequency in Hz or fps, convergence time for planning algorithms). Lower latency is generally better, especially for real-time robotic control.
Power Consumption: The amount of electrical power consumed by the hardware (e.g., in watts, joules/token, mW). Lower power consumption is critical for mobile, battery-operated robots.
Energy Efficiency: The ratio of computational performance to power consumption (e.g., TFLOPS/W - tera floating-point operations per second per watt, $µJ/token$ - microjoules per token). Higher energy efficiency is better.
Throughput: The amount of work done in a unit of time (e.g., fps for vision tasks, tasks/second).
Resource Utilization: How efficiently hardware resources are used.

While the paper doesn't provide the mathematical formulas for these general metrics, their conceptual definitions are standard in computer science and engineering. For example:
Accuracy (General Definition): The degree to which the result of a measurement, calculation, or specification conforms to the correct or true value. In classification tasks, it is often defined as the ratio of correctly predicted instances to the total number of instances. $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Latency (General Definition): The time delay between the cause and effect of some physical change in the system being observed. In computing, it is often the time taken for a system to respond to an input.
- No single universal formula; it's a direct time measurement.
Power Consumption (General Definition): The rate at which electrical energy is transferred, used, or dissipated.
- No single universal formula; it's a direct measurement, often $P = VI$ (voltage $\times$ current) for electrical power.
Energy Efficiency (TFLOPS/W): A common metric for computational accelerators, representing the number of tera floating-point operations per second (TFLOPS) performed per watt of power consumed. $ \mathrm{Energy\ Efficiency\ (TFLOPS/W)} = \frac{\text{TFLOPS}}{\text{Power Consumption (W)}} $
- TFLOPS: Tera Floating-point Operations Per Second (operations performed per second).
- Power Consumption (W): Power consumed by the hardware in Watts.
Energy Efficiency (µJ/token): A metric specific to LLMs or Transformer inference, representing the energy consumed per processed token. $ \mathrm{Energy\ Efficiency\ (\mu J/token)} = \frac{\text{Total Energy (µJ)}}{\text{Number of Tokens Processed}} $
- Total Energy (µJ): Total energy consumed in microjoules.
- Number of Tokens Processed: The total count of input/output tokens processed by the model.

5.3. Baselines

The paper does not define its own baselines. However, it implicitly refers to baselines used in the surveyed literature. These generally include:

General-purpose hardware: CPUs and GPUs are often used as baselines to demonstrate the performance-efficiency gap that specialized accelerators aim to address. For instance, $RRT*$ on CPUs is compared to GPU implementations, and GPU power consumption is contrasted with specialized ASICs.
Previous algorithmic approaches: When new algorithms are discussed (e.g., Transformer-based RT-1 or diffusion-based Diffusion Policy), they are implicitly compared to earlier, less intelligent or less efficient methods (e.g., PID control, classical motion planning).
Non-accelerated versions: For hardware acceleration works, the non-accelerated software implementation of the same algorithm on a standard platform (e.g., FPGA vs. CPU implementation of SLAM backend optimization) serves as a baseline.
Alternative acceleration techniques: Different hardware acceleration strategies (e.g., digital accelerators vs. CIM-based accelerators for Transformers) are compared against each other.
Simpler or smaller models: For complex embodied AI models, simpler or smaller models, or those without specific co-design optimizations, serve as baselines to demonstrate the improvements from algorithmic advancements or algorithm-hardware co-design. For example, GR00T N1 and $\pi_0$ are stated to outperform "baseline models" in various tasks.

The representativeness of these baselines is determined by the original authors of the referenced papers, who typically choose the most relevant and state-of-the-art comparisons for their specific contributions.

6. Results & Analysis

As a comprehensive survey paper, this work does not present original experimental results in the form of new data from its own experiments. Instead, its "results" are the synthesis and analysis of findings from the vast body of literature it reviews. The paper's core contribution in this section is to highlight the consistent benefits demonstrated by algorithm-hardware co-design across various robotic domains and embodied AI applications.

6.1. Core Results Analysis

The paper systematically analyzes the effectiveness of algorithm-hardware co-design by summarizing the performance and efficiency gains achieved by various specialized hardware accelerators for traditional robotics and embodied AI algorithms.

Addressing the Performance-Efficiency Gap: The paper consistently shows that general-purpose hardware (CPUs, GPUs) often fails to meet the stringent real-time and energy-efficiency requirements of robotics. For example, $RRT*$ takes several seconds on CPUs, making it impractical, while GPUs consume hundreds of watts, precluding their use in power-constrained mobile robots. This gap motivates the co-design approach.
Traditional Robotics - Specific Gains:
- Perception:
  - Detection: Convolutional imaging SoCs can achieve 0.2-to-3.6 TOPS/W (Tera Operations Per Second per Watt) for feature extraction.
  - Segmentation: Low-energy imagers with in-sensor event detection achieve 2.9 pJ/pixelxframe (picojoules per pixel per frame). CIM-based architectures reduce bandwidth and area.
  - Depth Estimation: A 1920x1080 30fps 2.3TOPS/W stereo-depth processor for SGM demonstrates high performance and energy efficiency.
  - Localization: A 2-mW fully integrated VIO accelerator (Navion) eliminates massive energy-consuming data transfer. FPGA accelerators leverage data sparsity, locality, similarity, and pipeline opportunities for energy efficiency.
- Task Planning: Tesla's FSD computer, an SoC with custom neural network accelerators, achieves impressive performance at low power consumption.
- Motion Planning: RACOD (for $A*$ ) predicts future poses and performs collision checks to improve speed. Dadu (for PRM) adopts hardware-friendly graph representation for scalability. A 1.5-µJ/task path-planning processor for RRT demonstrates energy efficiency in micro-robots.
- Action Mapping: Accelerators for IK (e.g., Dadu) exploit algorithmic parallelism for real-time performance and high energy efficiency in high-DOF applications. Robomorphic computing and RoboShape leverage robot morphology and topology for scalable and flexible accelerators.
- Control: FPGA accelerators for MPC improve quadratic programming solver speed. A 28nm 142mW motion-control SoC achieves high-performance for autonomous mobile robots.
Embodied AI - Specific Gains:
- Transformer Accelerators:
  - Digital accelerators (Wang et al.) achieve 27.5 TOPS/W with approximate computing and sparsity speculation. Tambe et al.'s processor achieves 18.1 TFLOPS/W with entropy-based early exit and mixed-precision. C-Transformer (Kim et al.) achieves 2.6–18.1 µJ/token for LLMs on mobile devices.
  - CIM-based accelerators (Tu et al.) achieve 15.59 µJ/token for sparse Transformers by minimizing external memory accesses and integrating transposition. MuitCIM (Tu et al.) reaches 2.24 µJ/token for multimodal Transformers. A lightning-like hybrid CIM macro achieves high energy efficiency for Transformers and CNNs.
- 3D Reconstruction Accelerators:
  - Point Cloud: DSPU reduces power by more than 60% for depth processing.
  - NeRF: MetaVRain achieves 133mW real-time NeRF processing, reducing computation by over 95%. NeuGPU achieves 345s modeling time (faster than edge GPUs). Space-Mate for NeRF-SLAM consumes 303.5mW.
  - 3D GS: GSCore achieves 91.2 fps (Jetson Xavier GPU only 6.4 fps). GauSPU achieves 63.9x improvement in energy efficiency compared to RTX 3090 GPU.
- Diffusion Model Accelerators:
  - Cambricon-D reduces memory access by more than 66% for diffusion models.
  - A CIM-based diffusion model accelerator achieves 74.34 TFLOPS/W.
Hierarchical and End-to-End Models: Frameworks like Helix, GR00T N1, $\pi_0$ , and GO-1 demonstrate impressive zero-shot generalization, instruction following, few-shot learning, and complex task handling, often significantly outperforming baselines in success rate (e.g., RDT-1B shows an average success rate improvement of 56%). OpenVLA also shows improved speed and success with efficient fine-tuning.

In summary, the paper's analysis of existing results strongly validates that algorithm-hardware co-design is not merely an optimization but a necessity for robotic computing systems, particularly as embodied AI models grow in complexity and demand real-time operation under tight power budgets. The diverse examples showcase how tailoring hardware to algorithmic characteristics (sparsity, parallelism, locality) and vice-versa yields substantial improvements across all key performance metrics.

6.2. Data Presentation (Tables)

The original paper is a survey and does not contain any tables presenting its own experimental results. All data points are extracted from the text of the referenced papers and discussed qualitatively or quantitatively within the narrative.

6.3. Ablation Studies / Parameter Analysis

As a survey paper, the authors did not conduct their own ablation studies or parameter analysis. However, they discuss the impact of certain design choices or optimizations that are akin to findings from ablation studies presented in the referenced papers. For example:

In the context of Transformer accelerators, Wang et al. investigated sparsity patterns and data redundancy to develop an approximate processing element, implying that their effectiveness was verified by comparing against non-optimized counterparts.
For NeRF reconstruction, Han et al.'s MetaVRain system uses spatial attention, temporal familiarity, and top-down attention stages, each contributing to computation reduction, which would typically be verified through ablation studies in their original work.
The Helix robot's implementation of hardware-aware training to mitigate temporal offset is an example of an optimization whose effectiveness would have been verified through comparison with a setup lacking this co-design approach.

The paper synthesizes these findings to support its overarching argument for algorithm-hardware co-design, but it does not perform these analyses itself.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a comprehensive overview of robotic computing systems through the lens of algorithm-hardware co-design, tracing the evolution from traditional robotics to hierarchical and end-to-end embodied AI models. The paper underscores that despite significant advancements in algorithms, achieving a balance among accuracy, latency, and power consumption remains a critical challenge, especially when relying on general-purpose hardware. The authors conclude that algorithm-hardware co-design is the primary methodology to address this performance-efficiency gap by leveraging inherent algorithmic properties like parallelism, locality, sparsity, and similarity for both algorithm optimization and hardware innovation. The survey details numerous examples of co-designed solutions across perception, planning, action mapping, control, and embodied AI components (Transformers, 3D reconstruction, diffusion models), demonstrating substantial system-wide benefits.

7.2. Limitations & Future Work

The paper explicitly identifies new challenges and many research opportunities for realizing smarter and faster robotic computing systems in the era of embodied AI. These challenges stem from the rapid evolution of embodied AI algorithms and the advancements in emerging hardware technologies.

Two main challenges are highlighted, leading to a proposed three-layer technology stack for future exploration (as illustrated in Figure 10 of the original paper):

The following figure (Figure 10 from the original paper) illustrates the co-design process of robotic computing systems and embodied AI algorithms with hardware:

该图像是一个示意图，展示了机器人计算系统和具身人工智能算法与硬件的协同设计流程，强调自顶向下的算法适应与自底向上的硬件特征利用之间的互动关系。

Adapting Computing Platforms to Rapidly Evolving Embodied AI Algorithms:
- Challenge: The fast-changing algorithm architectures and dynamic configurations of sensors and actuators demand flexible computing platforms.
- Opportunities (Top-Down - Software Toolchain & Hardware Architecture):
  - Software Toolchain:
    - Compilers: Need modular architectures to easily update new model structures and operators.
    - Operating Systems: Must provide generic software interfaces and autonomously manage data synchronization for sensors and actuators, simplifying algorithm transfer across configurations.
  - Hardware Architecture:
    - Reconfigurable Hardware: Must dynamically adjust settings (dataflow, precision) at runtime to meet varying performance demands (e.g., low latency for control, high energy efficiency for planning).
    - Heterogeneous Integration: Should seamlessly incorporate high-performance hardware tailored to different algorithms, enabling agile deployment for diverse applications.
Transforming the Potential of Emerging Hardware Innovations into End-to-End Inference Improvements:
- Challenge: Emerging technologies like 3D IC, Chiplet, and CIM introduce complex features and expand the design space, making performance evaluation and effective utilization difficult.
- Opportunities (Bottom-Up - Hardware Toolchain & Algorithms):
  - Hardware Toolchain:
    - Simulation Tools: Need precise models for new hardware characteristics and efficient simulation capabilities for accurate and quick deployment evaluation.
    - Electronic Design Automation (EDA) Tools: Must explore the enlarged design space afforded by emerging technologies to fully exploit their benefits for next-generation embodied AI platforms.
  - Algorithms:
    - Embodied AI Algorithms: Should be developed to directly exploit unique computation features of cutting-edge hardware. For example, 3D IC's ultra-high bandwidth (bank-wise) could motivate new training algorithms that partition gradient computing and parameter updating into bank granularities with minimal cross-bank communication.
      
      The paper emphasizes that embracing cross-stack co-design—where algorithm development directly considers hardware features and vice-versa—will be significant for the development of next-generation robotic computing systems.

7.3. Personal Insights & Critique

This survey is an incredibly valuable resource for anyone working in or studying robotics, AI, and computer architecture. Its strength lies in its integrated perspective, which is often missing in more siloed research.

Personal Insights:

Holistic View is Essential: The paper strongly reinforces that the era of designing algorithms or hardware in isolation is over for advanced robotics. Algorithm-hardware co-design is not a luxury but a fundamental necessity to overcome physical limitations and achieve practical embodied AI. This holistic perspective is crucial for students and researchers entering the field.
Roadmap Clarity: The three-step roadmap from traditional robotics to hierarchical and end-to-end embodied AI provides an excellent mental model for understanding the progression of the field. It helps contextualize why certain algorithms and hardware challenges emerge at different stages.
Sparsity, Locality, Parallelism, Similarity as Guiding Principles: The identification of these four common algorithmic properties as drivers for co-design is a powerful generalization. It offers a framework for analyzing any new algorithm for co-design opportunities.
Emphasis on Emerging Hardware: The detailed discussion of CIM, 3D IC, and Chiplet technologies and their specific implications for embodied AI is forward-looking and highly relevant, providing concrete directions for hardware architects.
The "Why" Behind the "What": The paper consistently explains why certain hardware solutions are developed (e.g., CIM for Von Neumann bottleneck, FPGA for flexibility), which is very helpful for a beginner to grasp the motivation behind complex technical choices.

Critique:

Density for Beginners: While "beginner-friendly" is a goal, the sheer volume of specific algorithm and hardware examples, especially in Sections 2 and 3, could still be overwhelming for a true novice. A more targeted introductory section, perhaps highlighting the most representative example for each category, followed by deeper dives, might improve accessibility.
Lack of Quantitative Synthesis: While the paper mentions performance gains (e.g., "63.9x improvement in energy efficiency"), it doesn't aggregate these into comparative tables or unified benchmarks across different approaches. As a survey, this might be outside its scope, but it would have provided a clearer quantitative landscape of the state-of-the-art for the reader.
Abstract Missing: The absence of an abstract in the provided text is a notable omission, as it usually serves as the primary entry point for readers to quickly grasp the paper's essence.
Formula Omission (Expected for Survey, but Still a Point): As noted in the methodology, the paper does not present its own mathematical formulas. While this is typical for a survey, for certain foundational concepts like self-attention (which is central to Transformers), proactively including its formula in the prerequisite section would further enhance beginner comprehension, even if the paper itself only references it. (I have addressed this in my Prerequisite Knowledge section).

Transferability and Future Value: The methods and conclusions of this paper are highly transferable. The algorithm-hardware co-design paradigm is applicable to almost any compute-intensive domain beyond robotics, such as edge AI, high-performance computing, and IoT devices, where power, latency, and throughput are critical constraints. The identified challenges in software toolchains, hardware architectures, and algorithm-level innovations for adapting to evolving AI models and hardware technologies are universally relevant to the future of computing. This paper serves as a foundational reference for researchers aiming to build efficient, intelligent, and autonomous systems in the coming decades.