Paper status: completed

Olaf: Bringing an Animated Character to Life in the Physical World

Published:12/19/2025

Animated Character Mechanical Design (1)Reinforcement Learning Control Mechanism (1)Animation Performance in the Physical World (1)Sound and Heat Optimization Strategies (1)Robotic Smooth Motion Learning (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper brings the animated character Olaf to the physical world, utilizing reinforcement learning for control. It introduces a compact mechanical design with hidden asymmetrical legs, and strategies for noise reduction and temperature control, validating the model's effective

Abstract

Animated characters often move in non-physical ways and have proportions that are far from a typical walking robot. This provides an ideal platform for innovation in both mechanical design and stylized motion control. In this paper, we bring Olaf to life in the physical world, relying on reinforcement learning guided by animation references for control. To create the illusion of Olaf's feet moving along his body, we hide two asymmetric legs under a soft foam skirt. To fit actuators inside the character, we use spherical and planar linkages in the arms, mouth, and eyes. Because the walk cycle results in harsh contact sounds, we introduce additional rewards that noticeably reduce impact noise. The large head, driven by small actuators in the character's slim neck, creates a risk of overheating, amplified by the costume. To keep actuators from overheating, we feed temperature values as additional inputs to policies, introducing new rewards to keep them within bounds. We validate the efficacy of our modeling in simulation and on hardware, demonstrating an unmatched level of believability for a costumed robotic character.

Mind Map

In-depth Reading

English Analysis~34 min read · 42,421 chars

1. Bibliographic Information

1.1. Title

The title of the paper is "Olaf: Bringing an Animated Character to Life in the Physical World".

1.2. Authors

The authors of the paper are:

David Müller
Espen Knoop
Dario Mylonopoulos
Agon Serifi
Michael A. Hopkins
Ruben Grandia
Moritz Bächer

Affiliations are not explicitly stated in the provided text, but the nature of the work suggests a collaboration of robotics and computer animation researchers, likely from academic institutions or research labs specializing in these areas.

1.3. Journal/Conference

The paper is published at Robotics: Science and Systems XX, ser. RSS2024. This indicates it was presented at the Robotics: Science and Systems (RSS) conference in 2024. RSS is a highly reputable and selective international conference in the field of robotics, known for presenting cutting-edge research in various areas of robotics including perception, manipulation, control, and learning. Publication at RSS signifies the work's significant contribution and high quality within the robotics research community.

1.4. Publication Year

The paper was published in 2024 (as indicated by RSS2024 and references dated 2025 imply an expected publication year, but the RSS conference typically publishes papers from the year it is held, so 2024 is the correct publication year for the conference itself, although the provided UTC date is 2025-12-18).

1.5. Abstract

The paper addresses the challenge of creating a physical robotic embodiment of an animated character, specifically Olaf, which often possesses non-physical movements and proportions atypical for traditional robots. This project serves as a platform for innovation in both mechanical design and stylized motion control. The authors achieve this by employing reinforcement learning (RL) guided by animation references for control.

Key innovations include:

A novel mechanical design featuring two asymmetric legs hidden under a soft foam skirt to create the illusion of Olaf's feet moving along his body.
The use of spherical and planar linkages to fit actuators within tight spatial constraints for the arms, mouth, and eyes, maintaining the character's slim appearance.
The incorporation of additional rewards in the RL framework to significantly reduce harsh foot contact sounds, enhancing believability.
A thermal-aware policy that feeds actuator temperature values as inputs and uses new rewards to keep temperatures within safe bounds, addressing overheating risks caused by the large head and slim neck design, exacerbated by the costume.

The efficacy of these solutions is validated through simulation and hardware experiments, demonstrating an unprecedented level of believability for a costumed robotic character.

1.6. Original Source Link

The original source link provided is https://arxiv.org/abs/2512.16705. The PDF link is https://arxiv.org/pdf/2512.16705v1.pdf. This indicates the paper is a preprint available on arXiv, a common repository for research papers prior to, or in conjunction with, formal publication. The publication status is preprint, with the specified publication at RSS 2024.

2. Executive Summary

2.1. Background & Motivation

The field of legged robotics has predominantly focused on functionality, robustness, and efficiency, leading to impressive achievements in navigating complex terrain and performing dynamic maneuvers. However, as robots increasingly interact with humans in domains like entertainment and companionship, the metrics of success expand beyond mere functional performance to include believability and character fidelity. Animated characters, with their often non-physical movements and unconventional proportions, present a significant challenge and an ideal testbed for innovation in this new paradigm.

The core problem the paper aims to solve is bringing a specific animated character, Olaf from Disney's Frozen, to life as a physical robot while preserving its unique visual appearance and movement style. This is challenging because Olaf's design—a large, heavy head, small snowball feet, and a slim neck—is far from typical robotic morphologies and poses significant mechanical and control constraints. For instance, the illusion of free-floating feet under his body requires novel leg mechanisms, and the disproportionate head mass coupled with small actuators in the neck creates a high risk of overheating. Furthermore, the believability of the character is fragile; harsh footstep impacts or jitter can easily break the illusion. Prior research has explored character robotics, but often with new characters or existing robotic characters with more favorable proportions. Bringing an existing, non-robotic, costumed character with less favorable proportions to life demands navigating complex tradeoffs between functionality and believability within a tight design envelope.

The paper's entry point and innovative idea lie in tackling these challenges through a synergistic approach combining compact mechatronic design with Reinforcement Learning (RL)-based control guided by animation references. This approach allows for the faithful reproduction of the character's stylized motions while addressing practical constraints like heat dissipation and impact noise.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of robotics and character animation:

Mechatronic Design for Character Fidelity: The development of a compact, scale-accurate robotic design for Olaf. This includes a novel asymmetric six-degrees-of-freedom (6-DoF) leg mechanism ingeniously hidden beneath a soft foam skirt to emulate Olaf's characteristic snowball feet. It also features the integration of remotely actuated spherical, planar, and spatial linkages for the arms, mouth, and eyes, which are crucial for achieving high-fidelity, expressive motion within the character's tight physical constraints.
Thermal-aware Policy: The introduction of a control policy that incorporates actuator temperature as an input and is trained to prevent overheating through a novel reward formulation based on Control Barrier Functions (CBFs). This is particularly critical for characters with disproportionate weight distributions and restricted actuator space.
Impact Reduction Reward: A reward term within the RL framework designed to substantially reduce footstep noise. This contributes significantly to preserving the character's believability by making its movements quieter and more natural, aligning with the illusion of a soft, animated character.

The key conclusions and findings are that this integrated approach allows for the creation of a freely walking robot that accurately imitates the animated character in terms of style and appearance. The innovations effectively address the unique mechanical and control challenges posed by such a character, demonstrating an unmatched level of believability for a costumed robotic character. Specifically, the thermal reward successfully mitigates overheating while maintaining tracking accuracy, and the impact reduction reward noticeably lowers footstep noise without compromising the characteristic gait. These findings solve the problem of translating stylized, non-physical animated movements into a robust and believable physical robotic form.

3.1. Foundational Concepts

To understand this paper, a reader would benefit from knowledge in several key areas:

Reinforcement Learning (RL): RL is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns a policy (a mapping from states to actions) through trial and error.
- Agent: The decision-maker (in this paper, Olaf's control system).
- Environment: The world the agent interacts with (the physical robot and its surroundings, or its simulation model).
- State ( $\boldsymbol s_t$ ): A description of the current situation of the agent and environment at time $t$ .
- Action ( $\boldsymbol a_t$ ): A decision made by the agent at time $t$ that affects the environment.
- Reward ( $r_t$ ): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The goal is to maximize the total cumulative reward.
- Policy ( $\pi$ ): A function that maps observed states to actions to be taken. In this paper, $\pi ( \mathbf { \boldsymbol { a } } _ { t } | \mathbf { \boldsymbol { s } } _ { t } , \mathbf { \boldsymbol { g } } _ { t } )$ indicates the action $\boldsymbol a_t$ is conditioned on state $\boldsymbol s_t$ and a control input $\boldsymbol g_t$ .
- Proximal Policy Optimization (PPO): A popular RL algorithm used for training the policies in this paper. It's an on-policy algorithm that strikes a balance between ease of implementation, sample efficiency, and good performance.
Robotics Kinematics and Dynamics:
- Degrees of Freedom (DoF): The number of independent parameters that define the configuration of a mechanical system. For example, a robotic arm might have 6 DoF to control its position and orientation in 3D space. Olaf has 25 DoF in total.
- Actuators: Devices that convert energy (typically electrical) into mechanical force or motion (e.g., motors). In robotics, they are responsible for moving the robot's joints.
- Linkages: Mechanical assemblies that transmit forces and motion. They are used here to remotely actuate joints, allowing actuators to be placed where space is available, away from the joint itself. Spherical linkages allow for rotational motion, while planar linkages restrict motion to a single plane.
- Inverse Kinematics (IK): The process of calculating the joint parameters required to achieve a desired position and orientation of the end-effector (e.g., hand, foot) of a kinematic chain. Forward kinematics is the opposite, calculating the end-effector pose from joint parameters.
Control Systems:
- Proportional-Derivative (PD) Controller: A feedback control loop mechanism widely used in industrial control systems. It calculates an error value as the difference between a desired setpoint and a measured process variable. The proportional term responds to the current error, and the derivative term responds to the rate of change of the error, helping to damp oscillations and improve stability. In this paper, PD controllers are used at the joints with position targets provided by the RL policy.
- Control Barrier Functions (CBFs): A method for ensuring safety in control systems. A CBF is a function that defines a "safe" region in the state space. By ensuring that the derivative of the CBF remains non-negative (or some other condition), the controller can guarantee that the system will not exit the safe region. This is used in the paper to prevent overheating and joint limit violations.
Character Animation:
- Animation References: Pre-designed motions created by human animators using animation software. These provide the stylistic target for the robot's movements.
- Gait Generation: The process of designing and creating walking or locomotion patterns. Stylized walk cycles refer to unique, character-specific walking patterns, like Olaf's heel-toe motion.
- Path Frame: A moving coordinate system often used in character control to make movements invariant to the character's global pose and facilitate smooth transitions. It defines a local reference for the character's motion along a path.
Thermal Dynamics: The study of how heat is generated, stored, and transferred in a system. In this context, it refers to how heat generated by actuators (due to Joule heating from current flow) leads to temperature changes, and how heat is dissipated to the environment. A first-order system is a common simple model for such dynamics.

3.2. Previous Works

The paper contextualizes its work by contrasting it with traditional robotics and highlighting developments in character robotics and RL-based control.

Traditional Legged Robots: Most legged robots (e.g., anthropomorphic robots like ASIMO [10] or Albert Hubo [11], or zoomorphic robots like MIT Cheetah [15] or ANYmal [16]) are inspired by biology and optimized for functionality, robustness, and efficiency [18]. Their designs often place actuators at the joints [10], [16] or use remote actuation through linkages [19], [17], [20]. This paper departs from this by prioritizing artistic fidelity over pure functional performance.
Character Robotics:
- Some prior work, like the Cosmo robot [6], created life-like, torque-controlled humanoids for entertainment, representing existing characters. However, Olaf is a non-robotic, costumed character with less favorable proportions, making it a distinct challenge.
- Other efforts focus on creating new robotic characters and pipelines for animating them [7], or playful robots for research and entertainment like Keepon [8] or Aibo [5]. This paper, however, focuses on bringing an existing animated character to life, which adds strict constraints on maintaining believability within a tight design envelope.
- The path frame concept used in this paper for RL control is built upon previous work in character control [7], which developed a design and control of a bipedal robotic character. This concept helps with global pose invariance and smooth policy transitions.
Reinforcement Learning for Locomotion: RL has seen significant progress in robust locomotion [23], [24], [25], [26], imitation learning [27], [28], [29], [30], and navigation [31], [32].
- Imitation learning [27] uses RL to learn skills by mimicking example demonstrations, often from human motion capture data or animation references. The DeepMimic paper [27] by Peng et al. is a foundational work in this area, demonstrating how RL can achieve physics-based character skills guided by examples.
- Recent RL research has also started incorporating real-world effects, such as energy losses [33] and impact-minimizing strategies for quieter locomotion [34], [35]. The paper builds on these ideas for impact reduction and introduces thermal-aware policies.
- Control Barrier Functions (CBFs) [37] are used as a safety-critical control method, which has applications in various robotics contexts to ensure constraints are met.

3.3. Technological Evolution

The evolution of robotics has seen a gradual shift. Initially, the focus was heavily on industrial applications, emphasizing precision, speed, and repetitive tasks in structured environments. This then expanded to mobile robotics, where functionality, robustness, and efficiency in unstructured environments became paramount, leading to agile legged robots capable of traversing difficult terrain.

More recently, as robots move into direct human interaction roles (e.g., entertainment, companionship, service), the emphasis has broadened to include human-centric qualities like believability, expressiveness, and character fidelity. This requires robots to not just perform tasks, but to do so in a way that is engaging and visually appealing, often mimicking the stylized movements and appearances of animated characters.

This paper's work fits squarely within this latter phase of technological evolution. It leverages advanced RL techniques, originally developed for complex dynamic locomotion, and combines them with novel mechatronic design principles to meet aesthetic and character-specific constraints, rather than purely functional ones. The integration of thermal awareness and impact reduction in RL policies further signifies a move towards robots that are not only capable but also perceive and respond to real-world physical nuances relevant to human interaction.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Character-First Design Philosophy: Unlike most legged robots that prioritize functional requirements (e.g., speed, efficiency, terrain traversal), Olaf's design is driven by artistic reference and character fidelity. This leads to unique mechanical challenges (e.g., large heavy head, slim neck, snowball feet, hidden legs) that necessitate novel solutions.
Novel Asymmetric Leg Mechanism: While linkages are common for remote actuation in robotics, the paper introduces a novel asymmetric 6-DoF leg design specifically to fit within Olaf's compact body and conceal the legs under a skirt, maximizing workspace despite severe spatial constraints. Traditional legged robots typically use symmetric leg designs.
Integrated RL with Specific Real-World Constraints: The Reinforcement Learning approach is not just for general locomotion or imitation learning. It is explicitly tailored to address two critical, real-world, character-specific constraints:
1. Thermal Management: The thermal-aware policy that incorporates actuator temperature as an input and uses CBF-based rewards to prevent overheating is a significant innovation. Most RL policies do not directly consider real-time actuator temperatures as part of their observation space or reward function, especially not in a CBF-constrained manner.
2. Impact Noise Reduction: The impact reduction reward term specifically targets footstep noise, a factor crucial for believability in a costumed character. While some RL work has explored quiet locomotion, integrating it directly into the reward function to preserve stylized gaits is a refined application for character robotics.
Fusion of RL and Classical Control for Expressiveness: The paper elegantly separates articulated backbone control (using RL for dynamic tasks) from show function control (using classical methods like polynomial fitting and PD loops for expressive elements like eyes, mouth, arms). This hybrid approach optimizes control complexity and leverages the strengths of each method for different aspects of character behavior.
Unfavorable Proportions: The paper directly tackles a non-robotic, costumed character with less favorable proportions compared to existing robotic characters like Cosmo [6], which already has a more humanoid form. This pushes the boundaries of what can be embodied physically while maintaining character integrity.

4. Methodology

The methodology employed in bringing Olaf to life involves a tightly integrated approach combining mechatronic design with Reinforcement Learning (RL) for control, complemented by classical control for expressive features. The core idea is to achieve a scale-accurate and believable robotic representation of the animated character, despite its non-physical movements and atypical proportions.

4.1. Principles

The core idea is a character-driven design and control paradigm. Instead of starting with a generic robot platform and adapting it, the process begins with the animated character's specifications (visual appearance, movement style, proportions) and works backward to design the mechanics and control system. This involves:

Mechanical Concealment: Designing mechanisms that are both compact and effective but remain hidden beneath the character's costume, preserving its aesthetic.
Stylized Motion Reproduction: Using Reinforcement Learning to learn policies that accurately imitate animation references while maintaining dynamic balance and robustness.
Physical Constraint Integration: Incorporating real-world physical constraints, such as actuator overheating and footstep noise, directly into the RL reward function and observation space to enhance believability and system longevity.
Separation of Concerns: Dividing control into an RL-driven articulation backbone for dynamic locomotion and classical control for low-inertia expressive show functions.

The theoretical basis behind RL is to learn complex control policies through interaction with an environment, maximizing a reward signal. Imitation learning specifically biases this process towards replicating desired behaviors from expert demonstrations (here, animation references). Control Barrier Functions provide a formal framework for incorporating safety constraints (thermal limits, joint limits) into the control design, ensuring that the system remains within safe operating regions.

4.2. Core Methodology In-depth (Layer by Layer)

The workflow is iterative, starting with the mechanical backbone and then adding expressive functions.

4.2.1. Workflow Overview

The process began with the main backbone (legs and neck), shown in green in Figure 2. For animation, an animation rig and animation references were maintained with matching degrees of freedom (DoF). Policies for standing and walking were iteratively trained and evaluated in simulation to explore optimal actuated DoF placement and expressiveness. In a second phase, mechanical show functions (arms, mouth, eyes, eyebrows), shown in blue, were added. These show functions are designed to drive expressive behavior without significantly affecting system dynamics.

The control system is layered:

The articulation backbone is controlled by RL policies.
The show functions use classical control methods.

The RL simulation model incorporates the mechanical design and actuator temperature dynamics. Policies are trained with a reward function that includes imitation terms for tracking kinematic references, and penalties for physical limits (joint ranges, actuator temperatures). Separate walking and standing policies are trained, each conditioned on control inputs ( $\mathbf { \nabla } _ { \mathbf { \boldsymbol { g } } _ { t } }$ ) for animation tracking and interactive control.

At runtime, Olaf is puppeteered via a remote interface. An Animation Engine processes commands, switches policies, triggers animations and audio, and provides joystick control.

The following figure (Figure 2 from the original paper) shows the overall workflow:

$Fig. 2: Mechatronic Design and RL-based Control. We separate the articulated backbone from the show functions. The backbone is controlled via policies conditioned on the high-level control input $\\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\lambda } } } } \\mathbf { \\sigma } _ { \\mathbf { \\sigma } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\mathbf { \\lambda } } } \\mathbf { \\sigma } _ { \\mathbf { \\lambda } _ { \\lambda } }$ and trained using a combination of imitation, overheating, and impact rewards. During training, the control inputs are randomized, whereas at runtime, the Animation Engine generates control inputs from puppeteering commands.$ 该图像是示意图，展示了Olaf的机电设计与基于强化学习的控制系统。左侧为机电设计，右侧为控制流程，其中包含奖励机制，包括模仿、冲击减少和热模型。在公式中，控制输入通过策略PPO生成，并在仿真运行中与动画引擎互动。

4.2.2. Mechatronic Design

The Olaf robot stands 88.7 cm tall (without hair) and weighs 14.9 kg. It has a total of 25 degrees of freedom (DoF): 6 per leg, 2 per shoulder, 3 in the neck, 1 in the jaw, 1 in the eyebrow, and 4 in the mechanical eyes. It uses Unitree and Dynamixel actuators, with three on-board computers for computation.

The following figure (Figure 3 from the original paper) shows an annotated cutaway view of the robot, illustrating the internal mechanisms:

Fig. 3: Mechatronic Design. Shells and skirt have been cut away to show the interior. Note that the costume is not shown. 该图像是一个示意图，展示了奥拉夫机器人内部的机械设计。图中标注了各个模块的位置，包括驱动器、软泡沫材料、计算模块及其他组件，清晰地展示了眼睛和下巴的链接机制，及肩部的联动结构。

Compact Design Envelope

Olaf's animated character has free-floating snowball feet with no visible legs. To emulate this, the robot's design conceals the legs within the lower body (the "snowball" section), limiting their motion envelope.

Asymmetric 6-DoF Leg Design: A novel design is implemented where one leg is inverted.
- The left leg has a rear-facing hip roll actuator and a forward knee.
- The right leg has a forward hip roll actuator and a rear-facing knee. This asymmetric configuration helps mitigate collisions between the two hip roll actuators and between the knees when the legs rotate in yaw. This design also reduces the part count because both legs are identical, not mirrored.
Remote Actuation for Shoulders: The limited space prevents placing actuators directly at the 2-DoF shoulder joints. Instead, actuators are placed within the torso, and the shoulder motion is driven through a spherical 5-bar linkage.
Mouth Mechanism: A single actuator drives both the upper and lower jaw. The lower jaw is actuated directly, while the upper jaw is coupled via a 4-bar linkage.
Mechanical Eyes: The eyes have independent direct-drive eye yaw. Eye pitch and eyelid movements are remotely actuated through 4-bar linkages. All other joints are direct-drive.

Soft Shells

The lower snowball (concealing the legs) is a flexible skirt made from polyurethane (PU) foam. This material provides sufficient structure to maintain shape while allowing deflection for larger leg movements (e.g., during recovery steps), preventing motion restriction. The foot snowballs are constructed similarly. The flexible foam also absorbs impacts, reducing damage during falls.

Costuming and Appendages

The costume is made of 4-way stretch fabric to conform to the robot and its movements.
A semi-rigid "boning" structure maintains the costume's shape around the mouth cavity.
Snap fasteners and magnets secure the costume around the eyes and mouth.
Arms, nose, buttons, eyebrows, and hair are attached with magnets, allowing them to affix atop the costume and break away during falls or impacts to mitigate damage.

4.2.3. Reinforcement Learning

Building on previous work [7], separate walking and standing policies are used, each an independent RL problem tailored to its motion regime.

At each time step $t$ , the agent produces an action $\mathbf { \boldsymbol { a } } _ { t }$ based on a policy $\pi$ : $\pi ( \mathbf { \boldsymbol { a } } _ { t } | \mathbf { \boldsymbol { s } } _ { t } , \mathbf { \boldsymbol { g } } _ { t } )$ where:

$\mathbf { \boldsymbol { a } } _ { t }$ is the action at time $t$ .
$\mathbf { \boldsymbol { s } } _ { t }$ is the observable state of the robot at time $t$ .
$\mathbf { \boldsymbol { g } } _ { t }$ is the control input at time $t$ .

The environment then returns the next state $s _ { t + 1 }$ and a scalar reward r _ { t }, which encourages accurate imitation of artist-defined kinematic motions while maintaining dynamic and robust balance.

Path Frame Concept

To achieve invariance to the robot's global pose and smooth transitions between policies, the path frame concept [7] is used. The path frame state at time $t$ is: $\pmb { p } _ { t } ^ { \mathrm { P F } } := \big ( x _ { t } ^ { \mathrm { P F } } , y _ { t } ^ { \mathrm { P F } } , \psi _ { t } ^ { \mathrm { P F } } \big )$ where:

$( x _ { t } ^ { \mathrm { P F } } , y _ { t } ^ { \mathrm { P F } } )$ denotes the horizontal position.
$\psi _ { t } ^ { \mathrm { P F } }$ denotes the yaw orientation.

During walking, the path frame advances by integrating the commanded path velocity $\pmb { v } _ { t } ^ { \mathrm { P F } }$ (Figure 4). During standing, the frame slowly converges toward the midpoint between the feet. Quantities relative to this frame are denoted with the superscript $\mathcal { P }$ . The path frame is constrained to remain within a bounded distance of the torso to prevent excessive deviation.

The following figure (Figure 4 from the original paper) visualizes the path frame concept:

$Fig. 4: Path Frame. Visualization of the path-frame concept from \[7\] and the robot's center of mass.$ 该图像是示意图，展示了路径框架概念及机器人重心的可视化。图中采用了波浪形曲线表示运动轨迹，周围的灰色圆形可能代表不同的接触点或支撑点。

Animation Reference

Animation references for walking and standing are created by artists. These are conditioned on a control input $\mathbf { \nabla } _ { \mathbf { \boldsymbol { g } } _ { t } }$ . A gait generation tool [36] is used to design stylized walk cycles with heel-toe motion to capture Olaf's characteristic gait.

Based on these references, the full kinematic target state $\mathbf { \Delta } _ { \mathbf { \mathcal { X } } _ { t } }$ is obtained through a generator function $f ( \cdot )$ that maps the path-frame state $p _ { t } ^ { \operatorname { P F } }$ and the policy-dependent control input $\mathbf { \nabla } _ { \mathbf { \boldsymbol { g } } _ { t } }$ to the kinematic target using interpolation and path-frame alignment. This mapping is expressed as: $\begin{array} { r l } & { \boldsymbol x _ { t } : = ( \boldsymbol p _ { t } ^ { \mathcal { P } } , \boldsymbol { \theta } _ { t } ^ { \mathcal { P } } , \boldsymbol v _ { t } ^ { \mathcal { R } } , \boldsymbol \omega _ { t } ^ { \mathcal { R } } , \boldsymbol q _ { t } , \dot { \boldsymbol q } _ { t } , c _ { t } ^ { L } , c _ { t } ^ { R } ) , } \\ & { \boldsymbol g _ { t } : = \left\{ \begin{array} { l l } { ( \hat { \boldsymbol q } _ { t } ^ { \mathrm { n e c k } } , \hat { \boldsymbol \theta } _ { t } , \hat { p } _ { z , t } ) } & { \mathrm { s t a n d i n g } , } \\ { ( \hat { \boldsymbol q } _ { t } ^ { \mathrm { n e c k } } , \hat { \boldsymbol v } _ { t } ^ { \mathrm { P } } ) } & { \mathrm { w a l k i n g } . } \end{array} \right. } \\ & { \boldsymbol x _ { t } = \left\{ \begin{array} { l l } { f ( \boldsymbol p _ { t } ^ { \mathrm { P F } } , \boldsymbol g _ { t } ) } & { \mathrm { s t a n d i n g } , } \\ { f ( \boldsymbol p _ { t } ^ { \mathrm { P F } } , \boldsymbol g _ { t } , \phi _ { t } ) } & { \mathrm { w a l k i n g } . } \end{array} \right. } \end{array}$ where:

$\boldsymbol x _ { t }$ $x_{t}$ represents the full kinematic state of the robot, defined by:
- $\boldsymbol p _ { t } ^ { \mathcal { P } }$ : Torso position relative to the path frame.
- $\boldsymbol { \theta } _ { t } ^ { \mathcal { P } }$ : Torso orientation (unit quaternion) relative to the path frame.
- $\boldsymbol v _ { t } ^ { \mathcal { R } }$ : Linear torso velocity in the robot's root frame.
- $\boldsymbol \omega _ { t } ^ { \mathcal { R } }$ : Angular torso velocity in the robot's root frame.
- $\boldsymbol q _ { t }$ : Joint positions.
- $\dot { \boldsymbol q } _ { t }$ : Joint velocities.
- $c _ { t } ^ { L }$ and $c _ { t } ^ { R }$ : Left and right foot contact indicators.
$\boldsymbol g _ { t }$ $g_{t}$ is the control input to the generator function, which varies for standing and walking:
- For standing: it includes target neck joint position ( $\hat { \boldsymbol q } _ { t } ^ { \mathrm { n e c k } }$ ), target torso orientation ( $\hat { \boldsymbol \theta } _ { t }$ ), and target torso height ( $\hat { p } _ { z , t }$ ).
- For walking: it includes target neck joint position ( $\hat { \boldsymbol q } _ { t } ^ { \mathrm { n e c k } }$ ) and target path velocity ( $\hat { \boldsymbol v } _ { t } ^ { \mathrm { P } }$ ).
Hats $\hat { \cdot }$ denote target quantities.
For walking, $f ( \cdot )$ additionally includes the gait phase variable $\phi _ { t }$ .
The control input $\mathbf { \nabla } _ { \mathbf { \boldsymbol { g } } _ { t } }$ is randomized across its full range during training to ensure robustness and broad applicability.

Policy State

Actions $\mathbf { \boldsymbol { a } } _ { t }$ are position targets for Proportional-Derivative (PD) controllers at the joints. The robot's proprioceptive state $\boldsymbol s _ { t }$ , observed by the RL agent, is: $\boldsymbol s _ { t } := ( \boldsymbol p _ { t } ^ { \mathcal { P } } , \boldsymbol { \theta } _ { t } ^ { \mathcal { P } } , \boldsymbol { v } _ { t } ^ { \mathcal { R } } , \boldsymbol \omega _ { t } ^ { \mathcal { R } } , \boldsymbol { q } _ { t } , \dot { \boldsymbol { q } } _ { t } , \boldsymbol { a } _ { t - 1 } , \boldsymbol { a } _ { t - 2 } , \boldsymbol { T } _ { t } , \phi _ { t } )$ where:

$\boldsymbol p _ { t } ^ { \mathcal { P } }$ and $\boldsymbol { \theta } _ { t } ^ { \mathcal { P } }$ form the root pose relative to the path frame.
$\boldsymbol { v } _ { t } ^ { \mathcal { R } }$ and $\boldsymbol \omega _ { t } ^ { \mathcal { R } }$ denote the torso velocities expressed in the root frame.
$\boldsymbol q _ { t }$ and $\dot { \boldsymbol q } _ { t }$ are the joint positions and velocities.
$\boldsymbol { a } _ { t - 1 }$ and $\boldsymbol { a } _ { t - 2 }$ are the actions of the two previous time steps.
$\boldsymbol { T } _ { t }$ is the temperature of the actuators (a vector of temperatures for all relevant actuators).
For walking, the policy is additionally conditioned on the gait phase variable $\phi _ { t }$ .

Reward Formulation

The reward r _ { t } is a sum of four components: imitation, regularization, limits, and impact reduction. $r _ { t } = r _ { t } ^ { \mathrm { { i m i t a t i o n } } } + r _ { t } ^ { \mathrm { { r e g u l a r i z a t i o n } } } + r _ { t } ^ { \mathrm { { l i m i t s } } } + r _ { t } ^ { \mathrm { { i m p a c t \ r e d u c t i o n } } }$

Imitation and Regularization terms: Follow standard practice [27], [7], encouraging accurate tracking of reference motion with action penalties.
Limits terms: Capture constraints from Olaf's compact mechanical design.

Impact reduction term: Reduces foot impacts to lower footstep noise.

The following are the reward terms from Table I of the original paper. Hats $\hat { \cdot }$ denote reference quantities. $\boxminus$ is the $SO(3)$ log-map orientation difference and $\mathbb { 1 } [ \cdot ]$ is the indicator function.

Name	Reward Term
Imitation
Torso position xy	$\exp (-200.0 \cdot \\|\boldsymbol{p}_{x,y} - \hat{\boldsymbol{p}}_{x,y}\\|^2)$
Torso orientation	$\exp (-20.0 \cdot \\|\boldsymbol{\theta} \boxminus \hat{\boldsymbol{\theta}}\\|^2)$
Linear vel. xy	$\exp (-8.0 \cdot \\|\boldsymbol{v}_{x,y} - \hat{\boldsymbol{v}}_{x,y}\\|^2)$
Linear vel. z	$\exp (-8.0 \cdot \\|\boldsymbol{v}_z - \hat{\boldsymbol{v}}_z\\|^2)$
Angular vel. xy	$\exp (-2.0 \cdot \\|\boldsymbol{\omega}_{x,y} - \hat{\boldsymbol{\omega}}_{x,y}\\|^2)$
Angular vel. z	$\exp (-2.0 \cdot \\|\boldsymbol{\omega}_z - \hat{\boldsymbol{\omega}}_z\\|^2)$
Leg joint pos.	$-\\|\boldsymbol{q}_l - \hat{\boldsymbol{q}}_l\\|^2$
Neck joint pos.	$-\\|\boldsymbol{q}_n - \hat{\boldsymbol{q}}_n\\|^2$
Leg joint vel.	$-\\|\dot{\boldsymbol{q}}_l - \dot{\hat{\boldsymbol{q}}}_l\\|^2$
Neck joint vel.	$-\\|\dot{\boldsymbol{q}}_n - \dot{\hat{\boldsymbol{q}}}_n\\|^2$
Contact	$\sum_{i \in \{L, R\}} \mathbb{1}[c_i = \hat{c}_i]$
Survival	`1.0`
Regularization
Joint torques	$-\\|\boldsymbol{\tau}\\|^2$
Joint acc.	$-\\|\ddot{\boldsymbol{q}}\\|^2$
Leg action rate	$-\\|\boldsymbol{a}_l - \boldsymbol{a}_{t-1,l}\\|^2$
Neck action rate	$-\\|\boldsymbol{a}_n - \boldsymbol{a}_{t-1,n}\\|^2$
Leg action acc.	$-\\|\boldsymbol{a}_l - 2\boldsymbol{a}_{t-1,l} + \boldsymbol{a}_{t-2,l}\\|^2$
Neck action acc.	$-\\|\boldsymbol{a}_n - 2\boldsymbol{a}_{t-1,n} + \boldsymbol{a}_{t-2,n}\\|^2$
Limits
Neck temperature	$-\\|\min(-\boldsymbol{T}_n + \gamma_T (T_{\max} - \boldsymbol{T}_n), 0)\\|_1$
Joint limits (lower)	$-\\|\min(\boldsymbol{q} - (\boldsymbol{q}_{\min} + \boldsymbol{q}_m), 0)\\|_1$
Joint limits (upper)	$-\\|\min(-(\boldsymbol{q}_{\max} - \boldsymbol{q}_m) + \boldsymbol{q}, 0)\\|_1$
Foot-Foot collision	$-\mathbb{1}[\mathrm{contact}(L, R)]$
Impact Reduction
Sound suppression	$-\sum_{i \in \{L,R\}} \min(\Delta v^2_{z,i}, \Delta v^2_{\max})$

Explanation of Reward Terms:

Imitation Rewards: These terms penalize deviations from the target kinematic reference for various aspects of the robot's state:
- Torso position xy, Torso orientation, Linear vel. xy, Linear vel. z, Angular vel. xy, Angular vel. z: These use exponential decay penalties, meaning larger deviations are penalized much more severely. The SO(3) log-map orientation difference $\boxminus$ is used for orientations, which is a standard way to measure orientation error in 3D.
- Leg joint pos., Neck joint pos., Leg joint vel., Neck joint vel.: These use squared error penalties, directly penalizing differences from target joint positions and velocities. Weights differ between neck and legs due to different reflected inertias.
- Contact: Rewards if the actual foot contact state ( $c_i$ ) matches the reference contact state ( $\hat{c}_i$ ).
- Survival: A constant positive reward for each timestep the agent survives without falling or violating critical conditions.
Regularization Rewards: These terms encourage smooth and efficient actions:
- Joint torques: Penalizes large joint torques to reduce energy consumption and stress on actuators.
- Joint acc.: Penalizes large joint accelerations to promote smoother motions.
- Leg action rate, Neck action rate, Leg action acc., Neck action acc.: Penalize the rate of change and acceleration of actions (position targets for PD controllers), which helps reduce jitter and makes motor commands smoother.
Limits Rewards: These are critical for safety and mechanical integrity:
- Neck temperature: A penalty based on Control Barrier Functions (CBF) to prevent actuator overheating. This term is explained in detail in the next section.
- Joint limits (lower/upper): Penalties also based on CBF to prevent the joints from exceeding their physical range of motion. This is also detailed below.
- Foot-Foot collision: A binary penalty if the left and right snowball feet make contact with each other, preventing self-collisions.
Impact Reduction Reward:
- Sound suppression: This term penalizes the squared change in vertical foot velocity ( $\Delta v^2_{z,i}$ ) between simulation steps, encouraging smoother landings and reducing impact noise. The term is saturated by $\Delta v^2_{\max}$ to prevent large values from destabilizing critic learning.
  
  Early termination is applied if the head, torso, upper legs, or arms contact the ground.

Thermal Modeling

Olaf's slim, costume-covered neck necessitates small actuators supporting a heavy head, leading to frequent overheating. To address this, the actuator temperature $T$ must stay below a maximum temperature $T_{\mathrm {max}}$ . This is formalized as an inequality constraint in (5a): $h _ { T } ( T ) = T _ { \mathrm { m a x } } - T \geq 0$ where:

$h_T(T)$ is the Control Barrier Function for temperature.
$T_{\mathrm {max}}$ is the maximum allowable temperature.
$T$ is the current actuator temperature.

This constraint, which involves the slowly varying temperature state, is transformed into a Control Barrier Function (CBF) condition (5c), derived from (5b) $\dot { h } _ { T } ( T ) + \gamma _ { T } h _ { T } ( T ) \geq 0$ : $- \dot { T } + \gamma _ { T } ( T _ { \mathrm { m a x } } - T ) \geq 0$ where:
$\dot { T }$ is the time derivative (rate of change) of the actuator temperature.
$\gamma _ { T } > 0$ is a positive constant that determines how strongly the system is driven back into the safe set. This condition intuitively ensures that as the temperature $T$ approaches or exceeds $T_{\mathrm {max}}$ , the system's control actions must ensure that the temperature's rate of change $\dot{T}$ is zero or negative, thus preventing overheating.

The CBF constraint for each actuator is translated into a penalty term by calculating the total violation, as defined in the Neck temperature row of Table I: $-\|\min(-\boldsymbol{T}_n + \gamma_T (T_{\max} - \boldsymbol{T}_n), 0)\|_1$ This penalty is proportional to the magnitude of the violation, meaning if the condition $- \dot { T } + \gamma _ { T } ( T _ { \mathrm { m a x } } - T ) \geq 0$ is not met, a penalty is incurred.

To implement this thermal CBF in simulation, a model of the actuator thermal dynamics is required. These dynamics are primarily driven by electrical Joule heating $P$ , which scales with squared torque $\tau^2$ (since torque $\tau \propto I$ and power $P \propto I^2$ , where $I$ is current). The temperature dynamics are modeled as a first-order system driven by squared torque: $\dot { T } = - \alpha ( T - T _ { \mathrm { a m b i e n t } } ) + \beta \tau ^ { 2 }$ where:

$\dot { T }$ is the rate of change of temperature.
$\alpha$ is the thermal cooling coefficient, representing how quickly heat is dissipated to the environment.
$T$ is the current actuator temperature.
$T _ { \mathrm { a m b i e n t } }$ is the ambient temperature.
$\beta$ is the heating coefficient, representing how much heat is generated per unit of squared torque.
$\tau$ is the actuator torque. The parameters $\alpha$ , $\beta$ , and $T _ { \mathrm { a m b i e n t } }$ are fitted from experimental data.

Joint Limits

To prevent joint-limit violations, a similar reward function based on CBF conditions is used. It enforces a margin $q _ { \mathrm { m } }$ from each joint's physical limits $q _ { \mathrm { m i n } }$ and $q _ { \mathrm { m a x } }$ . For each joint, two CBF functions $h_q(q)$ are defined: $h _ { q } ( q ) = \left\{ \begin{array} { l l l } { q - \left( q _ { \mathrm { m i n } } + q _ { \mathrm { m } } \right) } & { \geq 0 , } & { \mathrm { l o w e r ~ l i m i t } , } \\ { \left( q _ { \mathrm { m a x } } - q _ { \mathrm { m } } \right) - q } & { \geq 0 , } & { \mathrm { u p p e r ~ l i m i t } , } \end{array} \right.$ where:

$q$ is the current joint position.
$q _ { \mathrm { m i n } }$ and $q _ { \mathrm { m a x } }$ are the physical lower and upper joint limits.
$q _ { \mathrm { m } }$ is a safety margin (set to 0.1 rad).

The corresponding per-joint CBF constraints are: $\begin{array} { r } { \dot { q } + \gamma _ { q } \left( q - \left( q _ { \mathrm { m i n } } + q _ { \mathrm { m } } \right) \right) \geq 0 , \quad \mathrm { l o w e r ~ l i m i t } , } \\ { - \dot { q } + \gamma _ { q } \left( \left( q _ { \mathrm { m a x } } - q _ { \mathrm { m } } \right) - q \right) \geq 0 , \quad \mathrm { u p p e r ~ l i m i t } , } \end{array}$ where:
$\dot { q }$ is the joint velocity.
$\gamma _ { q } > 0$ is a positive constant (set to 20). These constraints ensure that joint positions stay within the safe operating range by controlling their velocities as they approach limits. The penalties, as defined in Table I, are then derived from the violation of these CBF conditions for both lower and upper limits.

4.2.4. Show Functions

Olaf's show functions (eyes, eyebrows, jaw, arms) have low inertia and minimally affect system dynamics. Therefore, they are controlled using classical methods rather than RL.

The control process involves mapping functional space (how motions are animated) to actuator space (how actuators move). This mapping is derived using a forward-kinematics solver [38] by uniformly sampling the region of interest and fitting a polynomial.

Eyes: The functional space includes left and right eye yaw, coupled eye pitch, and eyelid closure. A first-order polynomial per actuator provides sufficient accuracy.
Arms: Implemented as spherical 5-bar linkages, each driven by two actuators. Their functional coordinates are parameterized by two serial revolute angles: arm swing followed by arm pitch. Arm swing maps directly to the first actuator, while arm pitch is coupled through both actuators. The second actuator's position is obtained via a cubic polynomial fit. After mapping, all eye and arm actuators are controlled using a PD loop.
Jaw: The costume's fabric tension and wrinkling introduce significant external forces. To compensate, a feedforward term $\tau _ { \mathrm { f f } } ^ { \mathrm { j a w } } ( \bar { q } ^ { \mathrm { j a w } } )$ is added. This term is determined by measuring the torque required to hold a set of uniformly sampled jaw angles $q ^ { \mathrm { j a w } }$ across the full range of motion. A first-order polynomial with an additional cosine term is fitted to the data using least squares to capture observed non-linearity: $\tau _ { \mathrm { f f } } ^ { \mathrm { j a w } } ( q ^ { \mathrm { j a w } } ) = c _ { 0 } + c _ { 1 } q ^ { \mathrm { j a w } } + c _ { \mathrm { c o s } } \cos ( q ^ { \mathrm { j a w } } )$ where:
$\tau _ { \mathrm { f f } } ^ { \mathrm { j a w } }$ is the feedforward torque for the jaw.
$q ^ { \mathrm { j a w } }$ is the jaw angle.
c _ { 0 }, c _ { 1 }, and $c _ { \mathrm { c o s } }$ are the fitted model parameters.

5. Experimental Setup

5.1. Datasets

The paper does not use traditional public datasets in the way typical machine learning papers do. Instead, the "data" for training and validation comes from two main sources:

Animation References: These are artist-created kinematic motions for walking and standing, designed to capture Olaf's characteristic, stylized movements, including a heel-toe gait. These references serve as the target behavior for the Reinforcement Learning policies. The control input $\boldsymbol g_t$ to the policy is randomized across its full range during training, meaning the policies learn to generalize across various commanded movements within the animated style.
Recorded Actuator Data: For the thermal model, data was collected from the physical robot. Specifically, 20 minutes of recorded data were used to fit the parameters ( $\alpha, \beta, T_{\mathrm{ambient}}$ ) of the thermal dynamics model (Equation 6) using least-squares regression. A separate 10-minute trajectory, not part of the training data, was used for validating the thermal model's predictive accuracy.

These data sources are effective for validating the method's performance because:
The animation references directly test the core objective of character fidelity and stylized motion imitation.
The recorded actuator data directly addresses a critical hardware constraint (overheating) and allows for empirical validation of the thermal model and thermal-aware control.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the performance of Olaf's control and mechanical design:

Mean Absolute Joint-Tracking Error:
1. Conceptual Definition: This metric quantifies how closely the robot's actual joint positions follow the desired joint positions specified by the kinematic reference or animation reference. A lower error indicates better imitation of the target motion. It's a direct measure of how well the robot performs its intended movements.
2. Mathematical Formula: Let $q_{actual,j,t}$ be the actual position of joint $j$ at time $t$ , and $q_{target,j,t}$ be the target position of joint $j$ at time $t$ . For a trajectory of $N$ time steps and $M$ joints, the mean absolute joint-tracking error (MAE) is: $ \mathrm{MAE} = \frac{1}{N \cdot M} \sum_{t=1}^{N} \sum_{j=1}^{M} |q_{actual,j,t} - q_{target,j,t}| $
3. Symbol Explanation:
  - $\mathrm{MAE}$ : Mean Absolute Error.
  - $N$ : Total number of time steps in the trajectory.
  - $M$ : Total number of joints being tracked.
  - $q_{actual,j,t}$ : The measured or simulated actual position of joint $j$ at time $t$ .
  - $q_{target,j,t}$ : The target position of joint $j$ at time $t$ , derived from the animation reference.
  - $| \cdot |$ : Absolute value.
  - $\sum$ : Summation operator.
Actuator Temperature ( $T$ , in °C):
1. Conceptual Definition: This metric directly measures the temperature of the actuators. It is crucial for assessing the effectiveness of the thermal-aware policy in preventing overheating and ensuring the longevity and safe operation of the robot, especially given the design constraints.
2. Mathematical Formula: Not a calculated metric in the evaluation sense, but a direct measurement from hardware sensors or simulated value from the thermal model.
3. Symbol Explanation:
  - $T$ : Actuator temperature in degrees Celsius (°C).
Mean Absolute Error of Thermal Model ( $1.87^\circ \mathrm{C}$ ):
1. Conceptual Definition: This measures the average absolute difference between the temperatures predicted by the thermal model (Equation 6) and the actual measured actuator temperatures over a period. It quantifies the accuracy of the model in predicting temperature dynamics.
2. Mathematical Formula: Similar to the joint-tracking error, but applied to temperature over time. For $N$ time steps: $ \mathrm{MAE}{\mathrm{thermal}} = \frac{1}{N} \sum{t=1}^{N} |T_{predicted,t} - T_{measured,t}| $
3. Symbol Explanation:
  - $\mathrm{MAE}_{\mathrm{thermal}}$ : Mean Absolute Error for the thermal model.
  - $N$ : Total number of time steps.
  - $T_{predicted,t}$ : Temperature predicted by the thermal model at time $t$ .
  - $T_{measured,t}$ : Actual measured actuator temperature at time $t$ .
Mean Sound Level Reduction (in dB):
1. Conceptual Definition: This metric quantifies the decrease in the average loudness of footstep sounds when the impact reduction reward is applied, compared to a baseline without it. Decibels (dB) is a logarithmic unit used to express the ratio of two values of a physical quantity, often power or intensity. In acoustics, it's used to measure sound pressure level. A reduction in dB indicates quieter operation, which contributes to the robot's believability.
2. Mathematical Formula: Sound level in dB is typically calculated as L_p = 10 \log_{10} \left( \frac{p^2}{p_{ref}^2} \right) \mathrm{dB}, where $p$ is the RMS sound pressure and $p_{ref}$ is a reference sound pressure. The reduction is the difference between two dB values, e.g., $L_{p, \mathrm{baseline}} - L_{p, \mathrm{improved}}$ .
3. Symbol Explanation:
  - $\mathrm{dB}$ : Decibel, a unit for sound level.
  - $L_p$ : Sound pressure level.
  - $p$ : Root Mean Square (RMS) sound pressure.
  - $p_{ref}$ : Reference sound pressure, typically $20 \mu \mathrm{Pa}$ (the threshold of human hearing). The reduction reported (e.g., 13.5 dB) directly indicates how much quieter the robot became.

5.3. Baselines

The paper primarily evaluates its novel contributions by comparing policies trained with the proposed features against policies trained without them. These serve as the implicit baselines:

Baseline for Thermal Control: A policy trained without the thermal reward. This baseline demonstrates the problem of rapid actuator overheating when the thermal constraints are not explicitly considered in the RL reward function or observation space.
Baseline for Impact Reduction: A policy trained without the foot impact reduction reward. This baseline highlights the default harsh contact sounds and higher peak foot velocities that occur when sound suppression is not incentivized, demonstrating the effectiveness of the proposed reward in making movements quieter.
Baseline for Stylized Gait: A policy trained without Olaf's characteristic heel-toe walk. This baseline is used to show the impact of the stylized gait on the visual appearance of the robot's locomotion, demonstrating that omitting specific animation nuances makes the motion appear more "robotic."

These baselines are representative because they isolate the effects of the specific innovations proposed in the paper, allowing for a clear assessment of their impact on robot performance and believability.

6. Results & Analysis

6.1. Core Results Analysis

Olaf's performance is demonstrated through both qualitative (visual results in the supplementary video) and quantitative evaluations. The asymmetrical leg design is successful in faithfully imitating Olaf's characteristic animations while fitting within the design constraints. The magnetic arms and nose contribute to character gags, enhancing believability.

6.1.1. Tracking Performance

The ability of the robot to accurately follow the kinematic references from animation is crucial for character fidelity.

For the standing policy, the mean absolute joint-tracking error was $3.87^\circ \pm 2.40^\circ$ .
For the walking policy, the mean absolute joint-tracking error was $4.02^\circ \pm 2.01^\circ$ . These low error values, averaged over 5-minute runs and across all joints, indicate that the RL policies are highly effective at imitating the artist-designed motions, which is fundamental for bringing the character to life.

The paper also highlights the importance of Olaf's characteristic heel-toe walk. When a policy was trained without this specific stylized gait, the resulting motion appeared "more robotic," visually demonstrating that fine details in animation references are critical for maintaining the illusion of a lifelike character.

6.1.2. Thermal Modeling

The accurate prediction of actuator temperatures is vital for implementing the thermal-aware policy. The thermal model (Equation 6) was fitted using parameters listed in Table II.

The following are the parameters from Table II of the original paper:

Thermal Model		Reward Function
$\alpha$	0.038	$T_{\max}$	80°
$\beta$	0.377	$T_{\mathrm{clip}}$	[70°C, 85°C]
$T_{\mathrm{ambient}}$	43.94	$\gamma_T$	0.312

The model's predictive accuracy was validated over a 10-minute trajectory not used during fitting. The model achieved a mean absolute error of $1.87^\circ \mathrm{C}$ .

The following figure (Figure 5 from the original paper) shows the validation of the thermal model:

$Fig. 5: Thermal Model. Validation of the thermal model in (6), comparing predicted actuator temperatures $T$ with measured values over a $1 0 \\mathrm { { m i n } }$ rollout. Temperatures are quantized as reported by the actuator.$ 该图像是图表，展示了热模型的验证结果。图中比较了预测的执行器温度 $T$ 与实测值在 10 ext{min} 的滚动过程中的变化，温度值以量化形式呈现。

Figure 5 illustrates that the simulated temperature (predicted by the model) closely tracks the measured temperature from the actuator over time, demonstrating good predictive accuracy and thus confidence in using this model within the RL simulation for thermal-aware control.

Evaluation of Thermal Reward: The effectiveness of the thermal reward was specifically evaluated on the neck-pitch actuator, identified as the most prone to overheating. The comparison was between policies trained with and without the thermal reward.

The following figure (Figure 6 from the original paper) presents the results of the thermal reward evaluation:

Fig. 6: Thermal Reward Evaluation. Neck-pitch temperature, mean joint-tracking error, and squared neck-pitch torque for policies trained with and without the thermal reward. The thermal reward slows temperature rise while maintaining tracking at low temperatures and slightly relaxes tracking near the temperature limit to prevent overheating. 该图像是图表，展示了带有和不带热量奖励的策略在不同时间点的温度、关节跟踪误差和扭矩平方的变化情况。图中显示，带热量奖励的策略有效降低了温度上升，并保持了较低的跟踪误差，防止过热。

Figure 6 shows a clear difference:

Without Thermal Reward (Baseline): The actuator temperature rises rapidly, reaching $100^\circ \mathrm{C}$ within 40 seconds, necessitating experiment termination to prevent damage. The squared torque (a proxy for heat generation) remains high.
With Thermal Reward: The actuator temperature rises significantly slower. While there is a slightly larger joint-tracking error, the policy primarily works by reducing torque usage well before reaching the temperature limit. As the temperature approaches the $80^\circ \mathrm{C}$ threshold, the policy adjusts the head towards a more horizontal orientation (requiring more torque initially but less over time), effectively managing heat generation. The tracking accuracy remains nearly the same at low temperatures and only slightly relaxes near the limit, demonstrating a successful trade-off between performance and thermal safety.

6.1.3. Foot Impact

The foot impact reduction reward was evaluated for its effect on sound suppression.

Over a 5-minute run, this reward reduced the mean sound level by $13.5 \mathrm{dB}$ . This is a substantial reduction, making the robot noticeably quieter and enhancing believability.

The following figure (Figure 7 from the original paper) compares foot velocity and position profiles:

$Fig. 7: Foot Impact Reduction. Comparison of $\\mathbf { Z }$ -foot velocity and position for the reference and policies trained with and without the foot impact reduction reward.$ 该图像是图表，展示了不同策略下 extbf{Z}-足的速度和位置对比。图中包含了三条曲线，分别表示引入脚部冲击减少奖励的策略、未引入奖励的策略和参考值。横轴为时间，单位为秒，纵轴分别为速度和位置，单位为米每秒和米。

Figure 7 illustrates the impact reduction mechanism:

Reference: Shows the desired vertical foot velocity and position profiles.
Policy without Impact Reduction: Follows the reference motion slightly better but exhibits higher peak velocities at foot impact, leading to harsher contacts and louder sounds.
Policy with Impact Reduction: The overall trajectory is preserved, but small nuances like the mid-swing dip of the foot are smoothed out. Crucially, it avoids high peak velocities at foot impact, which directly reduces impact forces and thus noise. This demonstrates that the reward acts as a regularizer, achieving quieter locomotion without significantly distorting the overall motion profile or tracking performance.

6.1.4. Reward Weights

The specific reward weights used for training the standing and walking policies are presented in Table III. These weights are crucial for balancing the different objectives (imitation, regularization, limits, impact reduction) and shaping the robot's behavior.

The following are the reward weights from Table III of the original paper. Two values indicate Standing / Walking; a single value applies to both.

Reward Name	Standing /Walking	Reward Name	Standing /Walking
Torso position xy	1.0/4.0	Neck action rate	5.0/10.0
Torso orientation	2.0/1.5	Leg action rate	2.0/5.0
Linear vel. xy	1.5/2.5	Leg action acc.	0.5/1.0
Linear vel. z	1.0	Neck action acc.	15.0/10.0
Angular vel. Z	1.5	Neck temperature	2.0
Leg joint pos.	15.0	Joint limits	0.5/0.2
Neck joint pos.	40.0	Foot-Foot collision	10.0
Leg joint vel.	1.0 · 10^-3	Impact reduction	2.5·10^-3
Neck joint vel.	0.5	Joint torques	1.0 · 10^-3
Contact	2.0/1.0	Joint acc.	2.5·10^-6
Survival	20.0

Analysis of Reward Weights:

High Weights for Imitation & Survival: Neck joint pos. (40.0), Leg joint pos. (15.0), and Survival (20.0) have relatively high weights, emphasizing the importance of staying upright and mimicking key joint positions.
Variable Weights for Torso/Linear Velocities: Weights for Torso position xy, Torso orientation, and Linear vel. xy differ between standing and walking, indicating different priorities for stability versus dynamic movement. For instance, Torso position xy is more heavily weighted during walking (4.0) than standing (1.0), likely to ensure the robot moves along its intended path.
Regularization Weights: Joint torques (1.0 · 10^-3) and Joint acc. (2.5 · 10^-6) have very low weights, suggesting a slight penalty to encourage smoothness without overly restricting dynamic behavior.
Action Rate and Acceleration Weights: Neck action rate and acc. have higher weights (5.0/10.0 and 15.0/10.0 respectively) than leg actions, potentially to ensure smoother and more controlled head movements, which are crucial for expressiveness.
Specialized Rewards: Neck temperature (2.0), Joint limits (0.5/0.2), Foot-Foot collision (10.0), and Impact reduction (2.5·10^-3) have moderate to low weights, but their inclusion is critical for safety and believability, even if their direct numerical contribution to the total reward is smaller than high-weighted imitation terms. The impact reduction weight is particularly low, likely to prevent it from excessively altering the desired gait while still providing a noticeable effect.

6.2. Ablation Studies / Parameter Analysis

The paper implicitly performs ablation studies by comparing policies trained with and without certain reward components:

Thermal Reward Ablation: Comparing policies with and without the Neck temperature reward (Figure 6) clearly shows its effectiveness in preventing overheating. This validates the thermal-aware policy and the CBF-based reward formulation.
Impact Reduction Reward Ablation: Comparing policies with and without the Impact reduction reward (Figure 7 and dB reduction data) demonstrates its efficacy in reducing footstep noise. This validates the design of this specific reward term.
Gait Stylization Ablation: The qualitative comparison of the heel-toe walk versus a standard gait (discussed in supplementary video) shows the importance of detailed animation references for character believability.

These comparisons serve as strong evidence for the necessity and effectiveness of the proposed innovations, demonstrating that each component contributes positively to the overall goal of bringing Olaf to life believably.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully presented Olaf, a robotic embodiment of an animated character, capable of freely walking and accurately imitating the character's unique style and appearance. The authors addressed significant design challenges, including unfavorable proportions and non-physical movements, through a combination of innovative mechatronic design and Reinforcement Learning (RL)-based control. Key achievements include the development of an asymmetric 6-DoF leg mechanism hidden beneath a foam skirt, the integration of impact-reducing rewards to significantly lower footstep noise, and the implementation of thermal-aware policies using Control Barrier Function (CBF) constraints to prevent actuator overheating and joint-limit violations. The validation on both simulation and hardware confirms that Olaf sets a new standard for believability in costumed robotic characters.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Thermal Model Fidelity: The current thermal model (Equation 6) is a first-order system that primarily considers Joule heating. Future work could incorporate a higher fidelity thermal model to account for mechanical effects like friction (which also generates heat) or the gradual heating of actuator enclosures during extended operation. This would lead to more accurate thermal predictions and potentially more robust thermal management.
Costume-Leg Interaction Modeling: The interaction forces between the costume and the legs were primarily handled through domain-randomized disturbance forces during RL training. Explicitly modeling these complex interactions could reduce the reliance on randomization and provide more targeted and efficient training, potentially leading to smoother and more consistent leg movements.

7.3. Personal Insights & Critique

This paper represents a fascinating intersection of robotics, animation, and control theory, pushing the boundaries of what character robotics can achieve. The core insight that believability is a critical performance metric, alongside traditional functionality, is highly relevant for the future of human-robot interaction.

Transferability: The proposed solutions, particularly the thermal-aware policy and impact reduction reward, are highly generalizable. The CBF-based thermal constraint, for instance, could be applied to any robot system where actuator overheating is a concern due to design constraints or demanding tasks. Similarly, impact reduction is valuable for any robot operating in environments where noise is undesirable or where delicate interaction is required. The asymmetric leg design principle could inspire compact mechanisms for other custom robotic forms.
Innovation in Problem Formulation: The strength of this work lies not just in its technical solutions, but in how it frames the problem itself. By rigorously incorporating aesthetic demands and character-specific nuances (like the heel-toe walk) into the RL reward function and mechanical design, it demonstrates a powerful methodology for creating emotive and engaging robots.
Potential Issues/Areas for Improvement:
- Costume Wear and Tear: While the magnetic attachments for appendages mitigate damage during falls, the 4-way stretch fabric costume, especially around moving parts like the jaw and legs, might be prone to wear and tear over extended operation. The paper mentions fabric tension and wrinkling around the jaw, which are addressed with a feedforward term, but maintenance and durability of the costume itself could be a practical challenge for long-term deployment.
- Real-world vs. Simulation Discrepancy: The paper mentions domain randomization for costume-leg interactions. While effective for sim-to-real transfer, explicitly modeling these interactions could reduce the sim-to-real gap and potentially improve robustness and efficiency of training. This ties into the authors' own future work suggestion.
- Actuator Power and Battery Life: Given the heavy head and small actuators in the neck, and the emphasis on preventing overheating, it would be interesting to see how the thermal-aware policy impacts overall power consumption and battery life during extended operation. While reducing torque can save energy, active cooling or larger actuators might be necessary for continuous high-performance tasks, which might conflict with the compact design.
- Expressiveness beyond Core Motions: While show functions are handled by classical control, integrating some aspects of facial expressions or arm gestures more deeply into the RL policy (perhaps through a hierarchical RL approach) could potentially lead to even more dynamic and context-aware expressiveness, especially in response to environment stimuli or human interaction.
  
  Overall, this paper provides a robust blueprint for future character robotics, highlighting the importance of interdisciplinary collaboration between mechanical design, control engineering, and animation to create truly believable robotic companions.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.