Robotic computing system and embodied AI evolution: an algorithm-hardware co-design perspective
TL;DR Summary
This study examines the evolution of robotic computing systems and embodied AI, proposing an algorithm-hardware co-design perspective to address challenges in real-time performance and energy efficiency, highlighting limitations of existing hardware in meeting advanced motion pla
Abstract
Abstract information is missing from the provided PDF first-page text.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Robotic computing system and embodied AI evolution: an algorithm-hardware co-design perspective
1.2. Authors
The authors of this paper are:
-
Longke Yan
-
Xin Zhao
-
Bohan Yang
-
Yongkun Wu
-
Guangnan Dai
-
Jiancong Li
-
Chi-Ying Tsui
-
Kwang-Ting Cheng
-
Yihan Zhang
-
Fengbin Tu
Most authors are affiliated with the AI Chip Center for Emerging Smart Systems (ACCESS), Hong Kong, China. Longke Yan is pursuing a Ph.D. at the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China, with research interests including computer architecture, VLSI design, embodied AI, and algorithm-hardware co-design. Fengbin Tu, an Assistant Professor at the Department of Electronic and Computer Engineering and Associate Director of the Institute of Integrated Circuits and Systems at The Hong Kong University of Science and Technology, has extensive experience in AI chips, computing-in-memory, computer architecture, and reconfigurable computing.
1.3. Journal/Conference
This paper is published in the Journal of Semiconductors. This journal is a peer-reviewed publication focusing on the latest advancements in semiconductor science and technology, indicating that the paper has undergone a review process by experts in the field. Its publication in this journal suggests its relevance and contribution to the fields of semiconductor technology, computing systems, and AI hardware.
1.4. Publication Year
2025
1.5. Abstract
The abstract information is missing from the provided PDF first-page text. However, based on the paper's content, the abstract would likely introduce the rapid evolution of robotics and embodied AI, highlighting the increasing demands on computing systems. It would then propose algorithm-hardware co-design as a critical methodology to balance performance metrics like accuracy, latency, and power consumption. The paper would outline a roadmap from traditional robotics to hierarchical and end-to-end embodied AI models, discussing associated algorithmic advancements and specialized hardware accelerators. Finally, it would likely conclude by identifying future challenges and research opportunities for next-generation robotic computing systems through cross-stack co-design.
1.6. Original Source Link
/files/papers/69293dc1ba903910b6a9733b/paper.pdf (This appears to be an internal or relative link, not a full external URL. Based on the doi provided in the paper, the official online view is https://doi.org/10.1088/1674-4926/25020034). The paper is officially published.
2. Executive Summary
2.1. Background & Motivation
The field of robotics has made significant progress, enabling machines to perform complex tasks autonomously or collaboratively. A typical robotic system consists of a physical body and an intelligent agent (brain) comprising sensory, computing, and actuation subsystems. The computing system, integrating intelligent algorithms and supporting hardware, is pivotal for functionality and performance.
Recent advancements in robotic algorithms, particularly in areas like computer vision, motion planning (e.g., ), and control (e.g., MPC), have dramatically improved intelligence and accuracy. However, these advanced algorithms, especially with the rise of embodied AI (where robots leverage data-driven AI models like Transformers and diffusion models), place immense computational demands on hardware. General-purpose computing hardware like CPUs and GPUs often struggle to meet the stringent real-time, energy-efficiency, and power-constrained requirements of robotic systems, leading to a significant performance-efficiency gap. For example, on CPUs can take seconds, while GPUs consume hundreds of watts, making them impractical for mobile robots or drones. This gap is exacerbated by large-scale embodied AI models with billions of parameters.
The core problem the paper aims to solve is this performance-efficiency gap in robotic computing systems, especially for embodied AI, due to the mismatch between algorithm demands and general-purpose hardware capabilities. This problem is crucial for enabling the widespread deployment of intelligent, autonomous robots in various constrained environments (e.g., mobile robots, drones, manufacturing floors). The paper's entry point is algorithm-hardware co-design, a methodology that analyzes algorithm computational behaviors on hardware and optimizes them through both algorithm refinement and specialized hardware innovation to achieve substantial system-wide benefits.
2.2. Main Contributions / Findings
As a comprehensive survey paper, its main contributions are:
-
Comprehensive Overview: It provides a systematic and detailed overview of
robotic computing systemsfrom both algorithmic and hardware perspectives, covering traditional robotics and the emergingembodied AI. -
Evolutionary Roadmap: It proposes a clear three-step roadmap for the evolution of
embodied AI, moving fromtraditional robotics(model-driven) tohierarchical models(cognitive planning + action execution) and ultimately toend-to-end models(vision-language-action). This roadmap helps contextualize current and future research. -
Algorithm-Hardware Co-design Emphasis: It rigorously highlights and demonstrates the critical role of
algorithm-hardware co-designas the primary methodology to address theperformance-efficiency gap. It illustrates how inherent properties in robotics algorithms (e.g., parallelism, locality, sparsity, similarity) create opportunities for co-design, leading to enhanced performance and efficiency. -
Detailed Survey of Co-design Works: It extensively reviews recent works on both traditional robotic algorithms (perception, planning, action mapping, control) and
embodied AIalgorithms (Transformers, 3D reconstruction, diffusion models), showcasing specific examples ofalgorithm-hardware co-designtechniques and their benefits. -
Future Challenges and Opportunities: It identifies two primary challenges for
embodied AIcomputing systems: (1) adapting computing platforms to the rapid evolution ofembodied AIalgorithms, and (2) transforming the potential of emerging hardware innovations (e.g.,3D IC,Chiplet,CIM) into end-to-end inference improvements. It then proposes athree-layer technology stack(software toolchain, hardware architecture, algorithms) and corresponding opportunities for future research.The key conclusion is that
algorithm-hardware co-designis indispensable for realizing real-time, energy-efficient, and intelligentrobotic computing systemsandembodied AI, especially at the edge. The paper's findings help researchers understand the current state, major challenges, and promising directions for developing next-generation robotic platforms.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice reader should be familiar with the following foundational concepts:
-
Robotic System Architecture:
- Physical Entity ("Body"): The physical platform of a robot, which interacts with the physical world. Examples include manipulator arms, humanoid robots, quadrupeds, or drones.
- Intelligent Agent ("Brain"): The computational and control core of the robot. It comprises three fundamental subsystems:
- Sensory System: Collects data about the robot's internal state and the external environment using sensors like cameras,
LiDARs(Light Detection and Ranging, a remote sensing method that uses pulsed laser to measure ranges), etc. - Computing System: The focus of this paper, it integrates intelligent robotic algorithms and supporting hardware to process sensory data and dictate robotic actions.
- Actuation System: Enables the robot's movement through components like servomotors, drives, and transmissions.
- Sensory System: Collects data about the robot's internal state and the external environment using sensors like cameras,
- Degrees of Freedom (DOF): The number of independent parameters that define the configuration of a mechanical system. For robots,
DOFoften refers to the number of joints that can move independently.
-
Algorithm-Hardware Co-design: A methodology that involves analyzing the computational characteristics of an algorithm (e.g., parallelism, data locality, sparsity, data similarity) and concurrently optimizing both the algorithm and the underlying hardware to achieve superior system-wide performance, power efficiency, and cost. It's an iterative process where algorithm optimizations inform hardware design, and hardware capabilities influence algorithm development.
-
Artificial Intelligence (AI) / Machine Learning (ML) Models:
- Convolutional Neural Networks (CNNs): A class of deep neural networks predominantly used for analyzing visual imagery. They employ
convolutional layers(mathematical operations that apply a filter to an input to produce a feature map),pooling layers(down-sampling operations), andfully connected layersto learn hierarchical features. - Transformers: A neural network architecture introduced in 2017, primarily designed for sequence-to-sequence tasks but now widely applied to various domains including vision and robotics. Its key innovation is the
self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. - Diffusion Models: A class of generative models that learn to reverse a gradual diffusion process (adding noise to data) to generate new data samples. They are powerful for generating high-quality images and are increasingly used for sequence generation tasks like robot actions.
- Large Language Models (LLMs): Very large neural networks trained on massive text datasets, capable of understanding, generating, and processing human language. Examples include
ChatGPT,PaLM. - Vision-Language Models (VLMs): Multimodal AI models that can process and understand information from both visual (images, video) and textual inputs. They can perform tasks like image captioning, visual question answering, and, in
embodied AI, understand commands and perceive environments.
- Convolutional Neural Networks (CNNs): A class of deep neural networks predominantly used for analyzing visual imagery. They employ
-
Hardware Technologies:
- Central Processing Unit (CPU): The "brain" of a computer, responsible for executing instructions and performing calculations. General-purpose but can be slow for highly parallel tasks.
- Graphics Processing Unit (GPU): A specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the creation of images. Highly parallel architecture, excellent for
deep learningcomputations, but can be power-intensive. - Field-Programmable Gate Array (FPGA): An integrated circuit that can be configured by a customer or designer after manufacturing. Offers high flexibility and energy efficiency for specific tasks, bridging the gap between general-purpose
CPUs/GPUsand fixed-functionASICs. - Application-Specific Integrated Circuit (ASIC): An integrated circuit customized for a particular use rather than general-purpose use. Offers the highest performance and energy efficiency for its specific task but lacks flexibility.
- System-on-Chip (SoC): An integrated circuit that integrates all components of a computer or other electronic system into a single chip. It typically includes
CPU,GPU, memory, input/output ports, and specialized accelerators. - Compute-in-Memory (CIM): An emerging hardware paradigm that aims to overcome the
Von Neumann bottleneck(the bottleneck between the processor and memory) by performing computation directly within or very close to memory, reducing data movement and improving energy efficiency. - 3D Integrated Circuit (3D IC): Stacking multiple silicon wafers or dies vertically and interconnecting them with
Through-Silicon Vias (TSVs). This technology offers higher integration density, shorter interconnects, and higher bandwidth. - Chiplet: An approach to chip design where multiple small, specialized chips (
chiplets) are integrated into a single package, connected by high-speed links. This allows for modular design, mix-and-match functionality, and potentially larger, more powerful systems than monolithic chips.
3.2. Previous Works
The paper extensively references prior studies across both traditional robotics and embodied AI, highlighting the progression towards more intelligent and efficient systems.
-
Traditional Robotics Algorithms:
- Perception:
- Detection:
RCNNseries (two-stage, CNN-based) andYOLOseries (single-stage, faster). - Segmentation:
UNet(upsampling for map restoration) andDeepLab(dilated convolutions for multi-scale context). - Depth Estimation: Local methods (e.g.,
SAD) and Global methods (e.g.,Semi-Global Matching (SGM)) for stereo matching. - Localization:
Simultaneous Localization and Mapping (SLAM)(constructs maps while localizing) andVisual-Inertial Odometry (VIO)(calculates relative poses usingKalman filtering).
- Detection:
- Task Planning:
Finite State Machines (FSMs)(for structured environments, e.g., DARPA Urban Challenge) andPartially Observable Markov Decision Process (POMDP)(for uncertainty, e.g.,Hidden Markov Model (HMM)for intention prediction). - Motion Planning:
- Graph-search-based:
Dijkstraand (heuristic extension). - Sampling-based:
Probabilistic Roadmap Method (PRM)andRapidly-exploring Random Tree (RRT)and its variants like (outstanding efficiency in large environments).
- Graph-search-based:
- Action Mapping:
- Kinematics: Forward Kinematics (FK) and Inverse Kinematics (IK) (analytic methods, numerical methods like
Jacobian inverse). - Dynamics: Inverse and Forward Dynamics (crucial for
MPC, often using templating and code generation).
- Kinematics: Forward Kinematics (FK) and Inverse Kinematics (IK) (analytic methods, numerical methods like
- Control:
- Feedback control without prediction:
Proportional-Integral-Derivative (PID)control (simple, efficient). - Feedback control with prediction:
Model Predictive Control (MPC)(predicts future behavior, optimizes control sequence). - Multi-task control:
Whole-body control(closed-form methods, optimization methods likequadratic programming).
- Feedback control without prediction:
- Perception:
-
Embodied AI Algorithms:
- Cognitive Planning Models (leveraging LLMs/VLMs):
ChatGPT for Robotics(Microsoft): IntegratesChatGPTwith robotic function libraries for natural language task decomposition.Inner Monologue: CombinesLLM-based planning with real-time environmental feedback for dynamic reasoning.PaLM-E(Google): AVLM(540-billion-parameterPaLM+ 22-billion-parameterViT) trained end-to-end on robotic data for multimodal reasoning and long-horizon planning.
- Action Execution Models (specialized for low-level control):
- Pose Prediction (often involving 3D Reconstruction):
PointNet: Foundational for point cloud processing.ASGrasp(Samsung): Utilizesmulti-level convolutional GRU networkfor transparent object reconstruction.GraspNeRFandYOSO: Examples ofNeRF-based andTransformer-based grasping models.GaussianGrassper: Employs3D Gaussian Splatting (3D GS)for scene reconstruction.
- Action Generation (direct action output):
- Autoregression-based (Transformers):
RT-1(Google, 35M parameters,EfficientNet-B3backbone, action tokenization),ALOHA(Transformer encoder-decoder, 80M parameters,ResNet18backbones). - Diffusion-based:
Diffusion Policy(first to applyDDPMsto robot action space),UniP,AvDC(conditional diffusion for video generation),RDT-1B(diffusion foundation model,Transformerarchitecture).
- Autoregression-based (Transformers):
- Pose Prediction (often involving 3D Reconstruction):
- Hierarchical Frameworks (combining cognitive planning and action execution):
SkillDiffuser: UsesGPT-2for skill abstraction anddiffusion modelfor trajectory generation.- (Physical Intelligence): Pre-trained
VLMfor cognitive planning,diffusion-variant flow-matching modelfor action execution. GO-1(AgiBot): Pre-trainedVLM, latent planner, action expert.GR00T N1(NVIDIA):VLM-based System 2,Diffusion Transformer Module-based System 1.Helix(Figure):7B-VLM-based System 2,80M-Transformer-based System 1.
- End-to-End Models (integrating all into one VLA):
RT-2(Google): Integrates web-scale vision-language pretraining into robotic action generation (ViTto project images into language token space).RT-X: Builds uponRT-2using theOpen X-Embodimentdataset for generalization.OpenVLA(Stanford): State-of-the-art, open-sourceVLA(Prismatic-7B VLMbackbone:ViTvisual encoder,MLPprojector,Llama2backbone).
- Cognitive Planning Models (leveraging LLMs/VLMs):
-
Core Formulas (for prerequisite knowledge, especially Transformers): While the paper itself is a survey and does not introduce new formulas, understanding the
self-attentionmechanism is crucial forTransformers, which are foundational for manyembodied AImodels. The originalAttentionformula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices representing the input sequences, typically derived from the same input embedding through linear transformations.
- is the transpose of the Key matrix.
- is the dimension of the
Keyvectors. Dividing by is a scaling factor to prevent the dot products from growing too large, which can push thesoftmaxfunction into regions with very small gradients. - computes the similarity scores between
QueriesandKeys, scaled. - is the softmax function, which normalizes the scores to produce a probability distribution, indicating how much attention each
Valueshould get. - Multiplying by produces the output, where each output element is a weighted sum of the
Valuevectors, with weights determined by thesoftmaxscores.
3.3. Technological Evolution
The paper illustrates a clear technological evolution in robotic computing systems, primarily driven by advancements in AI and hardware capabilities.
-
Traditional Robotics (Pre-2022, Model-Driven):
- Algorithms: Characterized by explicit, model-driven approaches.
Perceptionrelied on rule-based computer vision or earlyCNNs.Planningused methods like , .ControlusedPIDor basicMPC. These methods often required precise models of the robot and environment, limiting their generality. - Hardware: Primarily relied on general-purpose
CPUsandGPUs, with some earlyFPGAorASICacceleration for bottlenecks. The focus was on optimizing specific, well-defined algorithms.
- Algorithms: Characterized by explicit, model-driven approaches.
-
Emergence of Embodied AI - Hierarchical Models (2022-2023):
- Motivation: Breakthroughs in
Transformersanddiffusion models(for long sequence/high-dimensional data) andLLMs/VLMs(for natural language understanding, multimodal perception, and long-horizon planning). - Architecture: A shift towards hierarchical structures.
- Higher Level:
Cognitive planning models(e.g.,LLMs/VLMslikeChatGPT for Robotics,PaLM-E) handle high-level reasoning, task decomposition, and long-horizon planning based on natural language instructions and multimodal perception. These models are often large. - Lower Level:
Action execution models(e.g.,Transformer-basedRT-1,ALOHA, ordiffusion model-basedDiffusion Policy) handle dexterous, low-level action control with faster inference speeds, trained on specific embodiment data.
- Higher Level:
- Hardware Implications: Increased demand for specialized accelerators for
Transformersanddiffusion models, as well as efficient handling of the communication and frequency mismatch between high-level (slower, complex) and low-level (faster, real-time) components (e.g.,Helix's hardware-aware training).
- Motivation: Breakthroughs in
-
Future of Embodied AI - End-to-End Models (2023-Present & Future):
-
Motivation: The ultimate goal is to integrate
perception,planning, andactioninto a single, unified model for general-purpose robots andArtificial General Intelligence (AGI). -
Architecture:
Vision-Language-Action (VLA)models (e.g.,RT-2,OpenVLA) that directly map multimodal inputs (vision, language) to robotic actions. These models are typically large, often built on pre-trainedVLMsfine-tuned with robotic data. -
Hardware Implications: Extreme computational and data throughput demands, pushing the limits of existing hardware and necessitating advanced
algorithm-hardware co-designstrategies, including emerging technologies likeCIM,3D IC, andChiplets, to achieve real-time, energy-efficient performance at the edge.The paper fits into this timeline by surveying the current state of co-design efforts at each stage, identifying the transition points, and forecasting future directions.
-
3.4. Differentiation Analysis
As a survey paper, its core differentiation lies not in proposing a new method, but in its comprehensive and unique perspective on the intersection of robotics, AI, and hardware.
-
Algorithm-Hardware Co-design Focus: Unlike many surveys that focus solely on robotic algorithms or hardware advancements, this paper explicitly and consistently frames the discussion through the lens of
algorithm-hardware co-design. It meticulously connects algorithmic properties (parallelism, sparsity, locality, similarity) to opportunities for hardware innovation, and vice-versa. This integrated perspective is crucial for understanding the true challenges and solutions in high-performance, energy-efficient robotics. -
Evolutionary Roadmap for Embodied AI: The paper provides a structured roadmap for the evolution from
traditional roboticstohierarchical embodied AIandend-to-end embodied AI. This framework helps organize the vast landscape ofembodied AIresearch and clarifies the progression of complexity in both algorithms and hardware demands. -
Bridging Traditional and Embodied AI: The survey effectively bridges the gap between
traditional roboticstechniques (e.g., ,MPC,SLAM) and the cutting-edge ofembodied AI(Transformers,diffusion models,LLMs). It shows how co-design principles apply to both, and how the demands of the latter significantly amplify the need for specialized hardware. -
Detailed Hardware Acceleration Examples: For each category of algorithms (perception, planning, action, control, and
embodied AIcomponents likeTransformers,3D reconstruction,diffusion models), the paper provides concrete examples of hardware acceleration techniques and specialized architectures (e.g.,CIM,FPGAdesigns,ASICs,SoCs). This level of detail makes it a valuable resource for hardware architects. -
Forward-Looking Perspective: Beyond summarizing existing work, the paper articulates clear future challenges and proposes a
three-layer technology stackfor addressing them, offering actionable directions for future research insoftware toolchains,hardware architectures, andalgorithms.In summary, this paper differentiates itself by offering an integrated, evolutionary, and forward-looking analysis of
robotic computing systemsthrough the critical paradigm ofalgorithm-hardware co-design, which is essential for realizing the full potential ofembodied AI.
4. Methodology
This paper is a comprehensive survey, and as such, its methodology is primarily its structured approach to analyzing and presenting existing research. The core idea is to demonstrate how algorithm-hardware co-design is crucial for the evolution of robotic computing systems and embodied AI. This is achieved by systematically reviewing algorithms and their corresponding hardware acceleration techniques across different stages of robotic intelligence.
4.1. Principles
The fundamental principle guiding this survey is the belief that achieving an appropriate balance among system metrics like accuracy, latency, and power consumption in robotic computing systems—especially with the rise of embodied AI—requires algorithm-hardware co-design. This methodology involves:
-
Analyzing Algorithm Computational Behaviors: Identifying common properties within algorithms such as
parallelism,locality,sparsity, andsimilarity. -
Algorithm Optimization: Introducing optimizations to algorithms to better leverage these inherent properties, thereby reducing computing and storage complexity and exposing opportunities for hardware support.
-
Hardware Innovation: Developing specialized hardware from architecture to circuits that is tailored to exploit these algorithmic properties, leading to substantial system-wide benefits.
The paper argues that by aligning the strengths of both algorithm design and hardware engineering,
algorithm-hardware co-designfacilitates enhanced performance and efficiency forrobotic computing systems.
4.2. Core Methodology In-depth (Layer by Layer)
The paper structures its analysis into several layers, moving from the foundational concepts of traditional robotics to the cutting edge of embodied AI, always emphasizing the algorithm-hardware co-design perspective.
4.2.1. Overall Algorithm-Hardware Co-design Framework
The paper introduces the algorithm-hardware co-design as a central theme, illustrated in Figure 2 of the original paper.
The following figure (Figure 2 from the original paper) shows the algorithm-hardware co-design framework:
该图像是一个示意图,展示了机器人计算系统与具身人工智能进化中的算法与硬件协同设计。图中分为机器人算法、算法优化和硬件创新三个部分,突出并行性、数据局部性、稀疏性和数据相似性在协同设计中的作用。
This diagram illustrates that robotic algorithms inherently exhibit common computational behaviors such as parallelism, locality, sparsity, and similarity. These properties create opportunities for:
- Algorithm Optimization: Modifying algorithms to better exploit these properties (e.g., reducing
compute and storage complexity, exposinghardware support opportunities). - Hardware Innovation: Designing hardware architectures and circuits that are specialized to efficiently execute these optimized algorithms (e.g.,
accelerators,CIM). The interplay between these two domains leads tosubstantial system-wide benefits(e.g., improved performance, energy efficiency).
4.2.2. Traditional Robotic System Architecture
The paper first deconstructs a typical robotic system into its functional components, as shown in Figure 3 of the original paper.
The following figure (Figure 3 from the original paper) shows the overall architecture of a robotic computing and actuation system:
该图像是一个示意图,展示了机器人计算系统和执行系统的整体架构,包括感知系统、任务规划、运动规划、动作映射和控制模块的协同工作流程,体现了从传感信息到执行动作的闭环控制。
This architecture comprises:
-
Sensory System: Gathers data from the internal state of the robot and the external environment (e.g., cameras,
LiDARs). -
Computing System: The "brain" that integrates algorithms and hardware. It processes sensory data to make intelligent decisions and translate them into actions. This system is broken down into five functional tasks:
- Perception: Processes sensory data to extract high-level information (e.g., objects, positions, maps).
- Task Planning: Decides what actions the robot should take based on environmental information.
- Motion Planning: Determines how to execute actions by generating collision-free trajectories.
- Action Mapping: Translates planned paths/trajectories into physical robot actions using robot mechanism models (kinematics, dynamics).
- Control: Sends command sequences or signals to actuators to ensure the robot follows reference actions in a feedback loop.
-
Actuation System: Executes the physical movements of the robot.
Each of these stages demands specialized algorithms and computing hardware for high accuracy, real-time performance, and energy efficiency.
4.2.3. Embodied AI Evolution Roadmap
The paper then traces the evolution of robotic computing systems and embodied AI through a three-step roadmap, depicted in Figure 1 and Figure 4 of the original paper.
The following figure (Figure 4 from the original paper) shows the evolution from traditional robotics to end-to-end embodied AI:
该图像是一个示意图,展示了从传统机器人学到端到端体化AI的认知规划与动作执行模型的演变及其算法-硬件协同设计框架。
The roadmap outlines the transition:
-
Traditional Robotics (Model-Driven): Relies on predefined models and explicit programming for specific tasks.
-
Hierarchical Model (Cognitive Planning + Action Execution): A transitional stage where
embodied AImodels are introduced.- Cognitive Planning Model: At a higher level, uses
LLMsorVLMsfor understanding, reasoning, and long-horizon planning. These models are often large and might operate at a lower frequency. - Action Execution Model: At a lower level, uses specialized models (e.g.,
Transformers,diffusion models) for dexterous action control, operating at a higher frequency.
- Cognitive Planning Model: At a higher level, uses
-
End-to-End Model (Vision-Language-Action - VLA): The ultimate goal, where
perception,planning, andactionare seamlessly integrated into a single model, enabling general-purpose tasks in real time with high intelligence.This roadmap serves as the organizational backbone for discussing
embodied AIalgorithms and their hardware implications.
4.2.4. Detailed Analysis of Traditional Robotics Algorithms and Hardware (Section 2)
The paper dedicates a significant portion to detailing the algorithms and their hardware accelerators for each of the five functional tasks in traditional robotics.
4.2.4.1. Perception
- Detection:
- Algorithms:
Two-stage methods(e.g.,RCNNseries) prioritize accuracy by first generatingregion proposalsand then classifying.Single-stage methods(e.g.,YOLOseries) prioritize speed by directly predicting categories and bounding boxes. - Hardware Co-design: Accelerators optimize
multiply-accumulate (MAC)operations within sensors (SoCs) or reduce computational overhead formulti-scale semantic feature extraction (MSFE)through parallel processing and data compression.
- Algorithms:
- Segmentation:
- Algorithms:
Semantic segmentationlabels regions by class,instance segmentationdistinguishes multiple objects of the same class.UNetandDeepLabare widely used, employing upsampling and dilated convolutions, respectively. - Hardware Co-design: Accelerators focus on reducing data intensity and pixel-level computations. Examples include low-energy imaging devices with
analog background subtractionandcompute-in-memory (CIM)architectures forfloating-pointoperations, reducing bandwidth and area.
- Algorithms:
- Depth Estimation:
- Algorithms: Estimates environmental depth from
stereo imagesby calculating disparities usingstereo matching algorithms.Local methods(e.g.,SAD) trade accuracy for speed;Global methods(e.g.,Semi-Global Matching (SGM)) are more accurate but time-consuming. - Hardware Co-design: Accelerating
stereo matchingis critical.SADcan be implemented onFPGAs with efficientline buffering.SGMprocessors usehigh-throughput pipelined architectureswith dependency-resolving schemes andcustomized ultra-high bandwidth SRAMto exploit unique data behaviors.
- Algorithms: Estimates environmental depth from
- Localization:
- Algorithms: Estimates robot pose (position and orientation).
SLAMconstructs maps while localizing, often formulated asconstrained nonlinear optimization.VIO(and otherodometrymethods) calculate relative poses, typically usingKalman filtering, without constructing explicit maps. - Hardware Co-design: For mobile robots,
energy-efficient FPGA acceleratorsforSLAMbackend optimization exploitdata sparsity,locality,similarity, andpipeline opportunities.Fully integrated VIO acceleratorson a single chip eliminate energy-consuming data transfers by usingcompression,sparsity,rescheduling, andparallelism.Unified localization algorithm frameworksidentify common kernels and co-design front-end and back-end accelerators withtask parallel processing,data locality memory schemes, andworkload schedulers. Frameworks likeArchytasautomate accelerator generation from algorithm descriptions.
- Algorithms: Estimates robot pose (position and orientation).
4.2.4.2. Task Planning
- Algorithms: Determines a sequence of future motions. Initially human-expert driven, then
Finite State Machines (FSMs)scheduled motions based on perception (e.g., DARPA Urban Challenge).Partially Observable Markov Decision Process (POMDP)andHidden Markov Model (HMM)handle uncertainty and dynamic changes. - Hardware Co-design: Historically on
microcontroller units (MCUs). Increasing intelligence demands higher computing power, leading toSystems-on-Chip (SoCs)(e.g.,Tesla's Full Self-Driving (FSD) computerwhich integratesCPUs,ISPs,GPUs, andcustom neural network accelerators).
4.2.4.3. Motion Planning
- Algorithms: Generates collision-free trajectories.
- Global Planning: Finds shortest collision-free paths.
Graph-search-based methods(e.g., ) traverse state spaces serially.Sampling-based methods(e.g.,PRM,RRT, ) find trajectories by probabilistic sampling, faster for high-dimensional spaces. - Local Planning: Refines trajectories to satisfy constraints (safety, physical limits), often formulated as
optimization problems.Numerical methods,distributed optimization(e.g.,ADMM), anddata-driven methodsare used.
- Global Planning: Finds shortest collision-free paths.
- Hardware Co-design:
- Global Planning: Accelerators exploit
similarity patternsin (RACOD) to predict future poses.PRMaccelerators (Dadu) adopt hardware-friendly graph representations.RRTis optimized withparallel schemesandprune-and-reuse strategiesfor dynamic environments. - Local Planning:
GPU-based optimization solversleveragedata-independent and data-dependent parallelismfor matrix operations.Factor graph accelerators(BLITZCRANK) exploitsparsityandincremental solving.Hardware generation frameworks(ORIANNA) automatically create customized accelerators.
- Global Planning: Accelerators exploit
4.2.4.4. Action Mapping
- Algorithms: Translates mathematical paths into physical robot actions using robot mechanism models.
- Kinematics: Describes motion without forces.
Forward Kinematics (FK)calculates end-effector poses from joint variables.Inverse Kinematics (IK)is the inverse, often more complex (analytic methods for lowDOF, numerical methods likeJacobian inversefor highDOF). - Dynamics: Describes relationship between forces/torques and motion/acceleration.
InverseandForward Dynamics(and their gradients) are key kernels inMPC.
- Kinematics: Describes motion without forces.
- Hardware Co-design:
- IK:
FPGA implementationsusedata type reductionandmodule sharing. Accelerators (Dadu) usespeculation-based Jacobian transpose algorithmsand exploitalgorithmic parallelism. - Dynamics:
Robomorphic computingdesigns accelerators parameterized by robot morphology, exploitingparallelismandmatrix sparsity.RoboShapeusestopology patternsfor scalable accelerators.Dadu-rbdprovides multifunctional pipelines to optimizedata localityand adapt torobot-specific sparsity.
- IK:
4.2.4.5. Control
- Algorithms: Ensures robots follow reference actions with feedback.
- Without Prediction:
PID control(simple, efficient). - With Prediction:
MPC(predicts future behavior, optimizes control commands). - Multi-task Control:
Whole-body controlfor redundantDOFs(closed-form methods usingpseudo-inverse matrices, oroptimization methodslikequadratic programming).
- Without Prediction:
- Hardware Co-design: For high control rates (),
hardware accelerationis necessary. Numerous works acceleratePID.MPCaccelerators onFPGAs (forquadratic programming solvers) andASICs (leveragingparallelandsparse natureswithpruning strategiesandphysical model transformation) are developed.
4.2.5. Detailed Analysis of Embodied AI Algorithms and Hardware (Section 3)
The paper then transitions to the embodied AI era, detailing the algorithms and hardware acceleration for hierarchical and end-to-end models.
4.2.5.1. Hierarchical Model
-
Cognitive Planning Model:
-
Algorithms:
LLMs(e.g.,ChatGPT for Robotics,Inner Monologue) for long-horizon task decomposition and reasoning from textual descriptions.VLMs(e.g.,PaLM-E, Figure 5) integrate visual and language inputs for multimodal reasoning. -
Hardware Co-design: These models are large (
billions of parameters), requiring efficient execution. The paper implicitly notes thatTransformer accelerators(discussed later) are crucial here. The following figure (Figure 5 from the original paper) shows thePaLM-Emodel architecture overview:
该图像是一张示意图,展示了利用视觉嵌入(emb)、视觉Transformer(ViT)和大规模语言模型(PaLM)进行任务问答的流程,体现了算法与硬件协同设计中的信息流。
-
-
Action Execution Model:
-
Algorithms: Executes low-level actions.
- Pose Prediction: Focuses on finding optimal final poses, often integrates
3D reconstruction(point cloud,NeRF,3D GS). Examples includeASGrasp(usingGSNet),AnyGrasp,GaussianGrasper,YOSO(usingPoint Transformer). - Action Generation: Directly outputs actions or action sequences.
Autoregression-based modelsuseTransformers(e.g.,RT-1,ALOHA, Figure 6) for sequence modeling.Diffusion-based models(e.g.,Diffusion Policy, Figure 7,UniP,AvDC,RDT-1B) represent action sequence generation as aconditional denoising diffusion process.
- Pose Prediction: Focuses on finding optimal final poses, often integrates
-
Hardware Co-design: Requires faster inference.
3D reconstruction acceleratorsanddiffusion model accelerators(discussed later) are critical. The following figure (Figure 6 from the original paper) shows theAction Chunking with Transformers (ACT)model architecture:
该图像是图表,展示了采用变压器的动作分块模型架构,左侧为数据输入部分,包括不同摄像头的图像和特征提取,右侧为变压器解码器,处理动作序列和位置嵌入信息。
The following figure (Figure 7 from the original paper) illustrates the general formulation of the diffusion policy:
该图像是示意图,展示了扩散策略的通用形式及其基于CNN和Transformer的具体实现结构,涉及观察输入、动作序列和条件嵌入等模块,含有公式用于说明卷积操作。 -
-
Hierarchical Frameworks:
-
Algorithms: Combine cognitive planning and action execution models (e.g.,
SkillDiffuser, ,GO-1,GR00T N1(Figure 8),Helix). These frameworks aim to leverage the strengths of both, handling complex tasks with multimodal perception, long-horizon planning, and dexterous actions. -
Hardware Co-design:
Helixdemonstrateshardware-aware trainingto mitigatetemporal misalignmentbetween high-frequency and low-frequency components. This highlights the ongoing need for dedicated co-designed accelerators. The following figure (Figure 8 from the original paper) shows theGR00T N1model architecture overview:
该图像是示意图,展示了GR00T N1模型的架构,结合图像观察、语言指令和机器人状态来执行动作。图中显示了图像和文本的编码过程,以及通过扩散变换器生成的动作令牌,最终实现机器人动作的控制。
-
4.2.5.2. End-to-End Model
-
Algorithms: Aims to integrate
perception,planning, andactioninto a single model.Vision-Language-Action (VLA)models (e.g.,RT-2,RT-X,OpenVLA(Figure 9)) process multimodal inputs to generate robotic actions, showing capabilities in long-horizon planning and reasoning. -
Hardware Co-design: These are highly resource-intensive due to their integrated nature and often large parameter counts. The following figure (Figure 9 from the original paper) shows the
OpenVLAmodel architecture overview:
该图像是示意图,展示了OpenVLA模型架构的概述。图中左上角为输入图像和语言指令,描述机器人任务(如“将茄子放入碗中”)。数据流经MLP投影器和多个模块(DinoV2和SigLIP),最终到达包含Llama 2 7B和Action De-Tokenizer的核心。右侧展示了机器人执行的7D动作表示,包括位置、旋转和抓取参数()。
4.2.5.3. Embodied AI Hardware
The paper identifies Transformers, 3D reconstruction, and diffusion models as primary computational bottlenecks across diverse embodied AI algorithms, driving the need for specialized accelerators.
- Transformer Accelerator:
- Challenges:
Self-attention mechanismhas quadratic computational and memory complexity, especially forLLMsand high-resolution images. - Hardware Co-design:
- Digital Accelerators: Exploit
sparsity patternsanddata redundancy(Wang et al.,Tambe et al.withentropy-based early exit,Kim et al.withC-Transformerandimplicit weight generation) to reduce computation, memory access, and power. - CIM Accelerators: Integrate computation directly within memory. Examples include
Tu et al.(reconfigurable streaming network,bitline-transpose CIM, sparse attention scheduler) andGuo et al.(hybrid analog-digitallightning-like CIMwith compressed adder tree and analog-storage quantizer).
- Digital Accelerators: Exploit
- Challenges:
- 3D Reconstruction Accelerator:
- Challenges: Intensive spatial computation and irregular memory access.
- Hardware Co-design: Uses
approximate computinganddataflow transformation.- Point Cloud Accelerators: Address irregular/sparse data. Examples include
Im et al.(ToF sensor,window-based techniqueforconjugate gradient, feature reuse),Sun et al.(reconfigurable sparse convolution core,neighbor search circuit),Jung et al.(PRNGfor neighbor search). - NeRF Reconstruction Accelerators: Address intensive
MLPcalculations and memory access. Examples includeHan et al.(MetaVRainwith spatial, temporal, top-down attention, periodic polynomial for positional encoding),Ryu et al.(NeuGPUwith on-chip hash tables,hybrid interpolation,similarity sparsity),Park et al.(Space-MateforNeRF-SLAMwithout-of-order SMoE routerandheterogeneous core). - 3D GS Reconstruction Accelerators: Address sorting and rasterization bottlenecks. Examples include
Lee et al.(GSCorewithGaussian shape-aware intersection test, two-stage hierarchical sorting),Wu et al.(GauSPUfor3D GSandSLAMusingsparse-tile-sampling).
- Point Cloud Accelerators: Address irregular/sparse data. Examples include
- Diffusion Model Accelerator:
- Challenges: Iterative
denoising processleads to linear increase in computational complexity and memory access. - Hardware Co-design: Focuses on optimizing dataflow across adjacent iterations using
differential computation.-
Digital Accelerators:
Kong et al.(Cambricon-D) optimizes for nonlinear operators by usingsign-mask dataflowandoutlier-aware processing elements. -
CIM Accelerators:
Guo et al.(divides input variations into dense integer and sparse floating-point. Usesradix-8 Booth CIMfor integer,reconfigurable 4-operand exponent CIM (4Op-ECIM)for floating-point, also configurable asCAMfor sparse value skipping).This layered and detailed breakdown, consistently linking algorithms to their hardware acceleration strategies and underlying computational properties, forms the core methodology of the survey.
-
- Challenges: Iterative
5. Experimental Setup
As a survey paper, "Robotic computing system and embodied AI evolution: an algorithm-hardware co-design perspective" does not present its own experimental setup, datasets, evaluation metrics, or baselines. Instead, it synthesizes and analyzes the experimental results and methodologies reported in hundreds of other research papers.
The paper's "experimental setup" is, in essence, its comprehensive literature review and structured analysis framework. It draws upon the diverse experimental methodologies from the referenced works to illustrate trends, challenges, and successful algorithm-hardware co-design implementations.
5.1. Datasets
The paper does not introduce or use its own datasets. It discusses various datasets used by the referenced papers, which are integral to understanding the context of the algorithms and hardware being reviewed. For instance:
-
RT-X(a referencedVLAmodel) builds upon theOpen X-Embodimentdataset, a collaborative dataset aggregating data from 22 distinct robotic platforms across 21 institutions, encompassing 527 skills and 160,266 tasks. This dataset aims to enhance generalization across various robotic platforms and scenarios. -
ALOHA(anAutoregression-basedaction generation model) learns from expert demonstrations of robot actions collected through teleoperation. -
PaLM-E(aVLM) is trained end-to-end on multiple robotic tasks, leveragingVLM pretrainingwith fine-tuning on robotic data. -
RDT-1B(adiffusion foundation model) is pretrained on a large-scale multi-robot dataset and finetuned on a self-created multi-task bimanual dataset. -
Helix(ahierarchical framework) is trained on substantial teleoperation data ().The choice of these diverse datasets by the original researchers is effective for validating the performance of their methods across different robotic tasks, modalities (visual, language, proprioceptive), and generalization capabilities.
5.2. Evaluation Metrics
The paper does not define or use its own evaluation metrics. It refers to the performance metrics discussed or reported in the cited works to highlight the effectiveness of algorithm-hardware co-design. Common metrics implicitly referenced or mentioned in the context of acceleration include:
-
Accuracy: Refers to the correctness of task execution or prediction (e.g., detection accuracy, segmentation accuracy, grasping success rate).
-
Latency: The delay between an input and a corresponding output (e.g., inference time, control loop frequency in
Hzorfps,convergence timefor planning algorithms). Lower latency is generally better, especially for real-time robotic control. -
Power Consumption: The amount of electrical power consumed by the hardware (e.g., in
watts,joules/token,mW). Lower power consumption is critical for mobile, battery-operated robots. -
Energy Efficiency: The ratio of computational performance to power consumption (e.g.,
TFLOPS/W- tera floating-point operations per second per watt, - microjoules per token). Higher energy efficiency is better. -
Throughput: The amount of work done in a unit of time (e.g.,
fpsfor vision tasks,tasks/second). -
Resource Utilization: How efficiently hardware resources are used.
While the paper doesn't provide the mathematical formulas for these general metrics, their conceptual definitions are standard in computer science and engineering. For example:
-
Accuracy (General Definition): The degree to which the result of a measurement, calculation, or specification conforms to the correct or true value. In classification tasks, it is often defined as the ratio of correctly predicted instances to the total number of instances. $ \mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
-
Latency (General Definition): The time delay between the cause and effect of some physical change in the system being observed. In computing, it is often the time taken for a system to respond to an input.
- No single universal formula; it's a direct time measurement.
-
Power Consumption (General Definition): The rate at which electrical energy is transferred, used, or dissipated.
- No single universal formula; it's a direct measurement, often (voltage current) for electrical power.
-
Energy Efficiency (TFLOPS/W): A common metric for computational accelerators, representing the number of tera floating-point operations per second (TFLOPS) performed per watt of power consumed. $ \mathrm{Energy\ Efficiency\ (TFLOPS/W)} = \frac{\text{TFLOPS}}{\text{Power Consumption (W)}} $
- TFLOPS: Tera Floating-point Operations Per Second (operations performed per second).
- Power Consumption (W): Power consumed by the hardware in Watts.
-
Energy Efficiency (µJ/token): A metric specific to
LLMsorTransformerinference, representing the energy consumed per processed token. $ \mathrm{Energy\ Efficiency\ (\mu J/token)} = \frac{\text{Total Energy (µJ)}}{\text{Number of Tokens Processed}} $- Total Energy (µJ): Total energy consumed in microjoules.
- Number of Tokens Processed: The total count of input/output tokens processed by the model.
5.3. Baselines
The paper does not define its own baselines. However, it implicitly refers to baselines used in the surveyed literature. These generally include:
-
General-purpose hardware:
CPUsandGPUsare often used as baselines to demonstrate theperformance-efficiency gapthat specialized accelerators aim to address. For instance, onCPUsis compared toGPUimplementations, andGPUpower consumption is contrasted with specializedASICs. -
Previous algorithmic approaches: When new algorithms are discussed (e.g.,
Transformer-basedRT-1ordiffusion-based Diffusion Policy), they are implicitly compared to earlier, less intelligent or less efficient methods (e.g.,PID control, classical motion planning). -
Non-accelerated versions: For hardware acceleration works, the non-accelerated software implementation of the same algorithm on a standard platform (e.g.,
FPGAvs.CPUimplementation ofSLAMbackend optimization) serves as a baseline. -
Alternative acceleration techniques: Different hardware acceleration strategies (e.g.,
digital acceleratorsvs.CIM-based acceleratorsforTransformers) are compared against each other. -
Simpler or smaller models: For complex
embodied AImodels, simpler or smaller models, or those without specific co-design optimizations, serve as baselines to demonstrate the improvements from algorithmic advancements oralgorithm-hardware co-design. For example,GR00T N1and are stated to outperform "baseline models" in various tasks.The representativeness of these baselines is determined by the original authors of the referenced papers, who typically choose the most relevant and state-of-the-art comparisons for their specific contributions.
6. Results & Analysis
As a comprehensive survey paper, this work does not present original experimental results in the form of new data from its own experiments. Instead, its "results" are the synthesis and analysis of findings from the vast body of literature it reviews. The paper's core contribution in this section is to highlight the consistent benefits demonstrated by algorithm-hardware co-design across various robotic domains and embodied AI applications.
6.1. Core Results Analysis
The paper systematically analyzes the effectiveness of algorithm-hardware co-design by summarizing the performance and efficiency gains achieved by various specialized hardware accelerators for traditional robotics and embodied AI algorithms.
-
Addressing the Performance-Efficiency Gap: The paper consistently shows that general-purpose hardware (CPUs, GPUs) often fails to meet the stringent real-time and energy-efficiency requirements of robotics. For example, takes
several secondsonCPUs, making it impractical, whileGPUsconsumehundreds of watts, precluding their use in power-constrained mobile robots. This gap motivates the co-design approach. -
Traditional Robotics - Specific Gains:
- Perception:
Detection: Convolutional imagingSoCs can achieve0.2-to-3.6 TOPS/W(Tera Operations Per Second per Watt) for feature extraction.Segmentation: Low-energy imagers with in-sensor event detection achieve2.9 pJ/pixelxframe(picojoules per pixel per frame).CIM-based architecturesreduce bandwidth and area.Depth Estimation: A1920x1080 30fps 2.3TOPS/W stereo-depth processorforSGMdemonstrates high performance and energy efficiency.Localization: A2-mWfully integratedVIO accelerator(Navion) eliminates massive energy-consuming data transfer.FPGAaccelerators leveragedata sparsity,locality,similarity, andpipeline opportunitiesfor energy efficiency.
- Task Planning:
Tesla's FSD computer, anSoCwithcustom neural network accelerators, achieves impressive performance atlow power consumption. - Motion Planning:
RACOD(for ) predicts future poses and performs collision checks to improve speed.Dadu(forPRM) adopts hardware-friendly graph representation for scalability. A1.5-µJ/task path-planning processorforRRTdemonstrates energy efficiency in micro-robots. - Action Mapping: Accelerators for
IK(e.g.,Dadu) exploitalgorithmic parallelismfor real-time performance and high energy efficiency in high-DOFapplications.Robomorphic computingandRoboShapeleverage robot morphology and topology for scalable and flexible accelerators. - Control:
FPGAaccelerators forMPCimprove quadratic programming solver speed. A28nm 142mW motion-control SoCachieves high-performance for autonomous mobile robots.
- Perception:
-
Embodied AI - Specific Gains:
- Transformer Accelerators:
- Digital accelerators (
Wang et al.) achieve27.5 TOPS/Wwithapproximate computingandsparsity speculation.Tambe et al.'sprocessor achieves18.1 TFLOPS/Wwithentropy-based early exitandmixed-precision.C-Transformer(Kim et al.) achieves2.6–18.1 µJ/tokenforLLMson mobile devices. CIM-based accelerators(Tu et al.) achieve15.59 µJ/tokenfor sparseTransformersby minimizing external memory accesses and integrating transposition.MuitCIM(Tu et al.) reaches2.24 µJ/tokenfor multimodalTransformers. Alightning-like hybrid CIM macroachieveshigh energy efficiencyforTransformersandCNNs.
- Digital accelerators (
- 3D Reconstruction Accelerators:
Point Cloud:DSPUreduces power bymore than 60%for depth processing.NeRF:MetaVRainachieves133mWreal-timeNeRFprocessing, reducing computation byover 95%.NeuGPUachieves345smodeling time (faster than edgeGPUs).Space-MateforNeRF-SLAMconsumes303.5mW.3D GS:GSCoreachieves91.2 fps(Jetson Xavier GPUonly6.4 fps).GauSPUachieves63.9ximprovement in energy efficiency compared toRTX 3090 GPU.
- Diffusion Model Accelerators:
Cambricon-Dreducesmemory access by more than 66%fordiffusion models.- A
CIM-based diffusion model acceleratorachieves74.34 TFLOPS/W.
- Transformer Accelerators:
-
Hierarchical and End-to-End Models: Frameworks like
Helix,GR00T N1, , andGO-1demonstrateimpressive zero-shot generalization,instruction following,few-shot learning, andcomplex task handling, often significantly outperforming baselines insuccess rate(e.g.,RDT-1Bshows an average success rate improvement of56%).OpenVLAalso shows improvedspeedandsuccesswith efficient fine-tuning.In summary, the paper's analysis of existing results strongly validates that
algorithm-hardware co-designis not merely an optimization but a necessity forrobotic computing systems, particularly asembodied AImodels grow in complexity and demand real-time operation under tight power budgets. The diverse examples showcase how tailoring hardware to algorithmic characteristics (sparsity, parallelism, locality) and vice-versa yields substantial improvements across all key performance metrics.
6.2. Data Presentation (Tables)
The original paper is a survey and does not contain any tables presenting its own experimental results. All data points are extracted from the text of the referenced papers and discussed qualitatively or quantitatively within the narrative.
6.3. Ablation Studies / Parameter Analysis
As a survey paper, the authors did not conduct their own ablation studies or parameter analysis. However, they discuss the impact of certain design choices or optimizations that are akin to findings from ablation studies presented in the referenced papers. For example:
-
In the context of
Transformeraccelerators,Wang et al.investigatedsparsity patternsanddata redundancyto develop anapproximate processing element, implying that their effectiveness was verified by comparing against non-optimized counterparts. -
For
NeRFreconstruction,Han et al.'sMetaVRainsystem usesspatial attention,temporal familiarity, andtop-down attentionstages, each contributing to computation reduction, which would typically be verified through ablation studies in their original work. -
The
Helixrobot's implementation ofhardware-aware trainingto mitigatetemporal offsetis an example of an optimization whose effectiveness would have been verified through comparison with a setup lacking this co-design approach.The paper synthesizes these findings to support its overarching argument for
algorithm-hardware co-design, but it does not perform these analyses itself.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a comprehensive overview of robotic computing systems through the lens of algorithm-hardware co-design, tracing the evolution from traditional robotics to hierarchical and end-to-end embodied AI models. The paper underscores that despite significant advancements in algorithms, achieving a balance among accuracy, latency, and power consumption remains a critical challenge, especially when relying on general-purpose hardware. The authors conclude that algorithm-hardware co-design is the primary methodology to address this performance-efficiency gap by leveraging inherent algorithmic properties like parallelism, locality, sparsity, and similarity for both algorithm optimization and hardware innovation. The survey details numerous examples of co-designed solutions across perception, planning, action mapping, control, and embodied AI components (Transformers, 3D reconstruction, diffusion models), demonstrating substantial system-wide benefits.
7.2. Limitations & Future Work
The paper explicitly identifies new challenges and many research opportunities for realizing smarter and faster robotic computing systems in the era of embodied AI. These challenges stem from the rapid evolution of embodied AI algorithms and the advancements in emerging hardware technologies.
Two main challenges are highlighted, leading to a proposed three-layer technology stack for future exploration (as illustrated in Figure 10 of the original paper):
The following figure (Figure 10 from the original paper) illustrates the co-design process of robotic computing systems and embodied AI algorithms with hardware:
该图像是一个示意图,展示了机器人计算系统和具身人工智能算法与硬件的协同设计流程,强调自顶向下的算法适应与自底向上的硬件特征利用之间的互动关系。
-
Adapting Computing Platforms to Rapidly Evolving Embodied AI Algorithms:
- Challenge: The fast-changing algorithm architectures and dynamic configurations of sensors and actuators demand flexible computing platforms.
- Opportunities (Top-Down - Software Toolchain & Hardware Architecture):
- Software Toolchain:
- Compilers: Need modular architectures to easily update new model structures and operators.
- Operating Systems: Must provide generic software interfaces and autonomously manage data synchronization for sensors and actuators, simplifying algorithm transfer across configurations.
- Hardware Architecture:
- Reconfigurable Hardware: Must dynamically adjust settings (dataflow, precision) at runtime to meet varying performance demands (e.g., low latency for control, high energy efficiency for planning).
- Heterogeneous Integration: Should seamlessly incorporate high-performance hardware tailored to different algorithms, enabling agile deployment for diverse applications.
- Software Toolchain:
-
Transforming the Potential of Emerging Hardware Innovations into End-to-End Inference Improvements:
- Challenge: Emerging technologies like
3D IC,Chiplet, andCIMintroduce complex features and expand the design space, making performance evaluation and effective utilization difficult. - Opportunities (Bottom-Up - Hardware Toolchain & Algorithms):
- Hardware Toolchain:
- Simulation Tools: Need precise models for new hardware characteristics and efficient simulation capabilities for accurate and quick deployment evaluation.
- Electronic Design Automation (EDA) Tools: Must explore the enlarged design space afforded by emerging technologies to fully exploit their benefits for next-generation
embodied AIplatforms.
- Algorithms:
-
Embodied AI Algorithms: Should be developed to directly exploit unique computation features of cutting-edge hardware. For example,3D IC's ultra-high bandwidth (bank-wise) could motivate new training algorithms that partition gradient computing and parameter updating into bank granularities with minimal cross-bank communication.The paper emphasizes that embracing
cross-stack co-design—where algorithm development directly considers hardware features and vice-versa—will be significant for the development of next-generationrobotic computing systems.
-
- Hardware Toolchain:
- Challenge: Emerging technologies like
7.3. Personal Insights & Critique
This survey is an incredibly valuable resource for anyone working in or studying robotics, AI, and computer architecture. Its strength lies in its integrated perspective, which is often missing in more siloed research.
Personal Insights:
- Holistic View is Essential: The paper strongly reinforces that the era of designing algorithms or hardware in isolation is over for advanced robotics.
Algorithm-hardware co-designis not a luxury but a fundamental necessity to overcome physical limitations and achieve practicalembodied AI. This holistic perspective is crucial for students and researchers entering the field. - Roadmap Clarity: The three-step roadmap from
traditional roboticstohierarchicalandend-to-end embodied AIprovides an excellent mental model for understanding the progression of the field. It helps contextualize why certain algorithms and hardware challenges emerge at different stages. - Sparsity, Locality, Parallelism, Similarity as Guiding Principles: The identification of these four common algorithmic properties as drivers for co-design is a powerful generalization. It offers a framework for analyzing any new algorithm for co-design opportunities.
- Emphasis on Emerging Hardware: The detailed discussion of
CIM,3D IC, andChiplettechnologies and their specific implications forembodied AIis forward-looking and highly relevant, providing concrete directions for hardware architects. - The "Why" Behind the "What": The paper consistently explains why certain hardware solutions are developed (e.g.,
CIMforVon Neumann bottleneck,FPGAfor flexibility), which is very helpful for a beginner to grasp the motivation behind complex technical choices.
Critique:
- Density for Beginners: While "beginner-friendly" is a goal, the sheer volume of specific algorithm and hardware examples, especially in Sections 2 and 3, could still be overwhelming for a true novice. A more targeted introductory section, perhaps highlighting the most representative example for each category, followed by deeper dives, might improve accessibility.
- Lack of Quantitative Synthesis: While the paper mentions performance gains (e.g., "63.9x improvement in energy efficiency"), it doesn't aggregate these into comparative tables or unified benchmarks across different approaches. As a survey, this might be outside its scope, but it would have provided a clearer quantitative landscape of the state-of-the-art for the reader.
- Abstract Missing: The absence of an abstract in the provided text is a notable omission, as it usually serves as the primary entry point for readers to quickly grasp the paper's essence.
- Formula Omission (Expected for Survey, but Still a Point): As noted in the methodology, the paper does not present its own mathematical formulas. While this is typical for a survey, for certain foundational concepts like
self-attention(which is central toTransformers), proactively including its formula in the prerequisite section would further enhance beginner comprehension, even if the paper itself only references it. (I have addressed this in my Prerequisite Knowledge section).
Transferability and Future Value:
The methods and conclusions of this paper are highly transferable. The algorithm-hardware co-design paradigm is applicable to almost any compute-intensive domain beyond robotics, such as edge AI, high-performance computing, and IoT devices, where power, latency, and throughput are critical constraints. The identified challenges in software toolchains, hardware architectures, and algorithm-level innovations for adapting to evolving AI models and hardware technologies are universally relevant to the future of computing. This paper serves as a foundational reference for researchers aiming to build efficient, intelligent, and autonomous systems in the coming decades.
Similar papers
Recommended via semantic vector search.