Paper status: completed

MARR: A Multi-Agent Reinforcement Resetter for Redirected Walking

Published:02/21/2024

Multi-Agent Reinforcement Learning (2)Redirected Walking Resetter Techniques (1)Behavior Optimization in Multi-User Environments (1)User Immersion and Presence (1)Context-Aware Resetting in Environments (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

MARR is a novel approach leveraging multi-agent reinforcement learning to optimize the reset mechanism in redirected walking for VR. It significantly reduces reset occurrences by finding optimal directions in dynamic user environments, enhancing immersion beyond existing heuristi

Abstract

The reset technique of Redirected Walking (RDW) forcibly reorients the user’s direction overtly to avoid collisions with boundaries, obstacles, or other users in the physical space. However, excessive resetting can decrease the user’s sense of immersion and presence. Several RDW studies have been conducted to address this issue. Among them, much research has been done on reset techniques that reduce the number of resets by devising reset direction rules or optimizing them for a given environment. However, existing optimization studies on reset techniques have mainly focused on a single-user environment. In a multi-user environment, the dynamic movement of other users and static obstacles in the physical space increase the possibility of resetting. In this study, we propose Multi-Agent Reinforcement Resetter (MARR), which resets the user taking into account both physical obstacles and multi-user movement to minimize the number of resets. MARR is trained using multi-agent reinforcement learning to determine the optimal reset direction in different environments. This approach allows MARR to effectively account for different environmental contexts, including arbitrary physical obstacles and the dynamic movements of other users in the same physical space. We compared MARR to other reset technologies through simulation tests and user studies, and found that MARR outperformed the existing methods. MARR improved performance by learning the optimal reset direction for each subtle technique used in training. MARR has the potential to be applied to new subtle techniques proposed in the future. Overall, our study confirmed that MARR is an effective reset technique in multi-user environments.

Mind Map

In-depth Reading

English Analysis~16 min read · 17,257 chars

1. Bibliographic Information

1.1. Title

MARR: A Multi-Agent Reinforcement Resetter for Redirected Walking

1.2. Authors

Ho Jung Lee (Department of Computer Science, Yonsei University, Seoul, South Korea)
Sang-Bin Jeon (Department of Computer Science, Yonsei University, Seoul, South Korea)
Yong-Hun Cho (Digital eXPerience Lab, Korea University, Seoul, South Korea)
In-Kwon Lee (Corresponding Author, Department of Computer Science, Yonsei University, Seoul, South Korea)

1.3. Journal/Conference

IEEE Transactions on Visualization and Computer Graphics (TVCG)

Status: Published.
Reputation: IEEE TVCG is one of the most prestigious and influential journals in the fields of computer graphics, visualization, and virtual reality. Being published here indicates high technical rigor and significant contribution to the field.

1.4. Publication Year

2024 (Published online February 21, 2024; Current version date January 31, 2025).

1.5. Abstract

This paper addresses a critical challenge in Virtual Reality (VR) locomotion known as Redirected Walking (RDW). Specifically, it focuses on the "Reset" technique—a safety mechanism triggered when a user is about to collide with a physical boundary (like a wall) or another user. While necessary, resets break immersion. Existing methods to minimize resets often fail in complex, multi-user environments with dynamic obstacles. The authors propose MARR (Multi-Agent Reinforcement Resetter), a system trained using Multi-Agent Reinforcement Learning (MARL). MARR allows multiple users (agents) to learn optimal reset directions by considering the static environment and the dynamic movements of other users. Experiments show MARR significantly reduces the number of resets compared to state-of-the-art heuristic methods in both simulations and real-user studies.

1.6. Original Source Link

Official IEEE Xplore Link (DOI: 10.1109/TVCG.2024.3368043) Status: Officially Published.

2. Executive Summary

2.1. Background & Motivation

The Core Problem: In Virtual Reality, users often want to explore vast virtual worlds while physically confined to a small room. Redirected Walking (RDW) solves this by subtly rotating the virtual world as the user walks, tricking them into walking in circles physically while they think they are walking straight. However, RDW isn't perfect. When a user gets too close to a wall or another user, the system must trigger a "Reset". This stops the user and forces them to turn around (e.g., "Please turn 180 degrees").

Why It Matters: Frequent resets are annoying and destroy the sense of presence (immersion). Ideally, a reset should reorient the user in a direction that allows them to walk for a long time before needing another reset.

The Gap: Most existing reset strategies are heuristics (simple rules of thumb), such as "always turn towards the center of the room" (Center-Reset) or "turn away from the nearest wall" (Gradient-Reset). These rules fail in Multi-User Environments. For example, turning toward the center might cause a collision if another user is standing there. Existing optimization methods mainly focus on single-user scenarios and cannot adapt to the chaotic, dynamic movement of multiple people sharing the same physical space.

2.2. Main Contributions & Findings

MARR (Multi-Agent Reinforcement Resetter): The authors propose the first reset controller powered by Multi-Agent Reinforcement Learning (MARL). Unlike static rules, MARR learns a policy where agents (users) cooperate to find reset directions that maximize future walking space and minimize collisions with each other.
Robust Performance: Through extensive simulations and user studies, MARR consistently outperformed existing methods (MR2C, R2G, SFR2G), significantly reducing the total number of resets in various room layouts and user counts.
Adaptability: The system proved effective across different "subtle" steering algorithms (like Steer-to-Center or Artificial Potential Fields), showing it can learn the optimal reset strategy regardless of the underlying steering logic.

The following figure (Figure 1 from the original paper) illustrates the difference between traditional heuristic resets and the proposed MARR. While methods like (a) MR2C or (b) R2G might direct a user into an obstacle or another user, (d) MARR intelligently finds an open path (the green arrow).

该图像是示意图，展示了在多用户环境中，MARR（多代理强化重置器）如何通过不同的重置方向减少用户的重置次数。图中(a)至(d)分别展示了在存在物理障碍和其他用户动态移动情况下的重置过程。

3.1. Foundational Concepts

To understand this paper, you need to grasp three key areas:

Redirected Walking (RDW):
- Subtle Techniques: These continuously apply small imperceptible rotations to the virtual camera. For example, as you walk straight, the system rotates the world slightly left, causing you to unconsciously turn right to compensate. This keeps you away from walls.
- Overt Techniques (Resets): When subtle techniques fail and a collision is imminent, the system halts the simulation. A visual prompt (e.g., an arrow) appears, commanding the user to "Reset"—usually a 2:1 turn (turning 360° virtually while physically turning 180°). This paper focuses on optimizing which direction the arrow points.
Artificial Potential Fields (APF):
- A method used in robotics and RDW where obstacles (walls, other users) exert a "repulsive force" and open spaces exert an "attractive force." Users are steered along the path of least resistance (the gradient).
Multi-Agent Reinforcement Learning (MARL):
- Reinforcement Learning (RL): An AI training method where an "agent" learns to make decisions by performing actions and receiving "rewards" (points) or "penalties."
- Multi-Agent: Multiple agents learn simultaneously in the same environment.
- CTDE (Centralized Training, Decentralized Execution): A specific MARL framework. During training, a central "brain" (Critic) sees everyone's position to help them learn. During testing (Execution), each agent (Actor) acts alone based only on what they see.

3.2. Previous Works

Heuristic Reset Methods:
- Reset-to-Center (R2C): Always points the user to the room's center. Simple but fails if the center has an obstacle.
- Modified Reset-to-Center (MR2C) & Reset-to-Gradient (R2G): Proposed by Thomas et al., these account for obstacles. R2G uses the repulsive forces from APF to determine the "safest" direction away from walls.
- Step-Forward R2G (SFR2G): An iterative version of R2G that simulates taking small steps to find a better local minimum.
Multi-User RDW:
- Bachman et al. and Messinger et al. extended APF to multi-user scenarios, treating other users as moving obstacles.
- However, previous reset strategies in multi-user settings were mostly reactive (avoiding immediate collision) rather than strategic (planning for long-term open space).

3.3. Technological Evolution & Differentiation

Evolution: RDW started with single-user heuristics $\rightarrow$ Multi-user heuristics (APF) $\rightarrow$ Single-Agent RL (predictive steering).
This Paper's Position: This is the first application of Multi-Agent RL specifically for the Reset phase.
Differentiation: unlike R2G or MR2C, which calculate directions based on current geometry (forces), MARR learns from millions of trial-and-error episodes. It "foresees" that a certain direction, while currently open, might lead to a collision with another moving user soon, and avoids it.

4. Methodology

4.1. Principles

The core principle of MARR is to treat the selection of a reset direction as a decision-making problem. The goal is to maximize the distance walked before the next reset. By using a Multi-Agent architecture, the system ensures that User A's reset doesn't just save User A, but also positions them in a way that doesn't interfere with User B.

4.2. Core Methodology In-depth (Layer by Layer)

The authors formulate the problem as a Markov Decision Process (MDP). We will break this down into the State, Action, and Reward structure, integrated with the network architecture.

The following figure (Figure 2 from the original paper) shows the system architecture. It utilizes the CTDE (Centralized Training, Decentralized Execution) framework. The Critic (center) collects data from all users during training to guide the learning, while each Actor (sides) makes decisions independently during actual use.

$该图像是示意图，展示了多代理强化学习中的 Actor-Critic 架构。图中左侧和右侧分别展示了两个代理（Actor 1 和 Actor 2）的观察 $O_1$ 和 $O_2$ 以及对应的动作 $a_1$ 和 $a_2$。中央的 Critic 负责评估各个代理的动作，帮助优化它们的策略。整体结构强调了在多用户环境中，各代理的协作与动态决策过程。$ 该图像是示意图，展示了多代理强化学习中的 Actor-Critic 架构。图中左侧和右侧分别展示了两个代理（Actor 1 和 Actor 2）的观察 $O_1$ 和 $O_2$ 以及对应的动作 $a_1$ 和 $a_2$ 。中央的 Critic 负责评估各个代理的动作，帮助优化它们的策略。整体结构强调了在多用户环境中，各代理的协作与动态决策过程。

4.2.1. State Observation ( $s_t$ )

When a reset is triggered at time $t$ , the system observes the environment. The global state $s_t$ contains observations from all $n$ users: $s_t = (o_1, o_2, ..., o_n)$

For the $i$ -th user (agent), their individual observation $o_i$ consists of their position and orientation relative to the room center: $o_i = (u_i, \theta_i), \quad i=1, 2, ..., n$

$u_i$ (Position): The 2D (x, y) coordinates of the user in the physical room. These are linearly scaled to the range $[-1, 1]$ based on the room size. This scaling allows the model to adapt to different room sizes without retraining.
$\theta_i$ (Orientation): The user's facing direction (yaw), also scaled to $[-1, 1]$ .

4.2.2. Action ( $a_t$ )

The model must output a specific angle for the user to turn towards. This is the Reset Direction ( $\theta_a$ ).

The authors define two types of resets (illustrated in Figure 3 below):

Boundary Reset: Caused by a wall or static obstacle. Let $\theta_n$ be the normal vector (perpendicular) of the wall.
User Reset: Caused by getting too close to another user. Let $\theta_c$ be the direction opposite to the user's current facing direction ( $\theta_i$ ).

$Fig. 3. Reset process by action of MARR: $\\theta _ { i }$ : The user's current direction of movement, $\\theta _ { n }$ : The normal direction of the wall or obstacle that caused the reset, $\\theta _ { c }$ : The opposite direction of $\\theta _ { i }$ , $\\theta _ { a }$ : The reset direction determined by MARR.$ 该图像是示意图，展示了多代理强化重置器（MARR）的重置过程。在图(a)中，表示用户当前运动方向的角度为 $\theta _{i}$ ，引发重置的墙壁或障碍物法线方向为 $\theta _{n}$ ，而MARR确定的重置方向为 $\theta _{a}$ 。图(b)中进一步展示了重置后各个方向的关系，其中 $\theta _{c}$ 是 $\theta _{i}$ 的相反方向。

The action $\theta_a$ is constrained to a valid semicircle to prevent resetting directly back into the obstacle. The formula for the valid action range is:

$\theta _ { a } \in \left\{ \begin{array} { l l } { \displaystyle \left[ \theta _ { n } - \displaystyle \frac { \pi } { 2 } , \theta _ { n } + \displaystyle \frac { \pi } { 2 } \right] \mathrm { , ~ i f ~ b o u n d a r y ~ r e s e t , } } & \\ { \displaystyle \left[ \theta _ { c } - \displaystyle \frac { \pi } { 2 } , \theta _ { c } + \displaystyle \frac { \pi } { 2 } \right] \mathrm { , ~ i f ~ u s e r ~ r e s e t , } } \end{array} \right.$

Explanation: $\frac{\pi}{2}$ is 90 degrees. This formula simply means the new direction must be within 90 degrees left or right of the "safe" direction (away from the wall or backwards). The neural network predicts a value which is mapped to this range.

4.2.3. Reward Function ( $R$ )

This is the most critical part. The reward function tells the AI what "success" looks like. It is a weighted sum of three components:

$R ( s _ { t } , a _ { t } ) = w _ { r } R _ { r } + w _ { d } R _ { d } + w _ { a } R _ { a }$

Let's break down each component ( $R_r, R_d, R_a$ ) and their weights ( $w_r, w_d, w_a$ ).

1. Reset Penalty ( $R_r$ ): $R_r = -1$

Purpose: This is a fixed negative reward (a punishment). Every time a reset happens, the agent loses points. The agent's goal is to maximize total points, so it tries to avoid resets (delaying this punishment) as much as possible.

2. Distance Reward ( $R_d$ ): $R_d(s_t, a_t) = d_j$

Symbol Definition: $d_j$ is the distance (in meters) the user manages to walk after this reset until the next reset occurs.
Purpose: This encourages the agent to choose a direction that leads to a long, uninterrupted walk. To make this fair across room sizes, $d_j$ is scaled by dividing it by the length of the physical room's side.

3. Area Visibility Reward ( $R_a$ ): This reward encourages facing open spaces immediately. It calculates the ratio of the "visible area" in the chosen direction compared to all possible directions.

$R _ { a } ( s _ { t } , a _ { t } ) = \frac { \mathrm { f o r w a r d ~ v i s i b l e ~ a r e a } } { \mathrm { t o t a l ~ v i s i b l e ~ a r e a } } = \frac { \displaystyle \int _ { \theta _ { a } - \pi / 8 } ^ { \theta _ { a } + \pi / 8 } \frac { 1 } { 2 } ( f ( \theta ) ) ^ { 2 } d \theta } { \displaystyle \int _ { - \pi } ^ { \pi } \frac { 1 } { 2 } ( f ( \theta ) ) ^ { 2 } d \theta }$

Symbol Explanation:
- $\theta_a$ : The chosen reset direction.
- $f(\theta)$ : The distance from the user's position to the nearest obstacle (wall or user) in direction $\theta$ . This is essentially a "ray-cast" distance.
- $\frac{1}{2}(f(\theta))^2 d\theta$ : This is the formula for the area of a small triangular sector in polar coordinates. Integrating this gives the total area of the visible polygon.
- Numerator Range $[\theta_a - \pi/8, \theta_a + \pi/8]$ : This defines a "cone of vision" of 45 degrees ( $\pi/4$ ) centered on the reset direction.
Intuition: The numerator calculates the open area directly in front of the reset direction. The denominator is the total open area around the user. If the user faces a wall, $f(\theta)$ is small, so the reward is low. If they face the biggest open space, the reward is high.

Weights ( $w$ ): Determined via hyperparameter optimization (Random Search):

$w_r = 1.0$
$w_d = 0.042$
$w_a = 0.9$

4.2.4. Training Algorithm

The authors use MA-POCA (Multi-Agent Posthumous Credit Assignment).

Why MA-POCA? In multi-agent scenarios, if one agent dies (or resets) and leaves the episode, standard algorithms struggle to assign credit for future rewards. MA-POCA uses a self-attention mechanism in the Critic to handle a varying number of agents and properly credit actions even after an agent's "episode" (path) ends.
Platform: Unity ML-Agents Toolkit.

5. Experimental Setup

5.1. Datasets & Environments

Since this is a simulation and VR study, there is no static "dataset" like ImageNet. Instead, data is generated dynamically in simulated physical spaces.

Physical Layouts: The authors designed six distinct room layouts to test robustness (shown in Figure 4 below):

(a) Simple: 5x5m or 10x10m room with one rectangular obstacle.
(b) Circle: Room with a central circular obstacle.
(c) Regular Squares: Auditorium-style with four pillars.
(d) Complex: A maze-like room with irregular walls.
(e/f) More/Less: Variations of 'Complex' to test generalization (unseen during training).

User Paths: Users walked randomly generated paths in a large virtual space ( $100 \times 100 m^2$ ), moving from one randomly spawned target to another.

5.2. Evaluation Metrics

Number of Resets:
- Definition: The total count of reset events triggered during a fixed simulation session (e.g., completing a path). Lower is better.
- Formula: Simple count $\sum_{t=0}^{T} \mathbb{1}(\text{reset}_t)$ .
Mean Distance between Resets (MDbR):
- Definition: The average distance a user walks in the virtual world between two consecutive reset events. Higher is better.
- Formula: $\text{MDbR} = \frac{\text{Total Virtual Distance Traveled}}{\text{Number of Resets}}$ .
Questionnaires (User Study):
- SSQ (Simulator Sickness Questionnaire): Measures physical discomfort (nausea, dizziness).
- E2I / IPQ / SUSPQ: Measure presence and immersion.

5.3. Baselines

The paper compares MARR against three standard reset algorithms:

MR2C (Modified Reset-to-Center): Resets towards the room center, adjusting slightly if the center is blocked.
R2G (Reset-to-Gradient): Uses Artificial Potential Fields (APF). Resets in the direction where the repulsive forces from walls/obstacles are lowest.
SFR2G (Step-Forward R2G): Simulates taking several steps in the R2G direction to find a better long-term spot.

These are tested in combination with three steering algorithms: NS (No Steering), S2C (Steer-to-Center), and TAPF (Thomas APF).

6. Results & Analysis

6.1. Simulation Results (E1 - E5)

Experiment 1 (Small Space 5x5m): As shown in Figure 5 below, MARR (orange/red bars) consistently achieves the lowest number of resets across all obstacle types (Simple, Circle, Complex). Note that SFR2G was excluded here because the space was too small for its gradient steps.

该图像是图表，展示了在不同环境（简单、圆形和复杂）下，MR2C、R2G和MARR三种重置技术的重置次数比较。横轴为不同的环境类别，纵轴为重置次数，结果显示MARR在各个环境下表现均优于其他方法。

Experiment 2 (Medium Space 10x10m): The following figure (Figure 6) shows results for 2 and 3 users.

Result: MARR significantly outperforms R2G and MR2C.
Insight: In the "Circle" environment (middle column), heuristic methods like MR2C perform poorly because the center is blocked. MARR learns to avoid the center, drastically reducing resets.

Experiment 3 (Large Space 20x20m, up to 8 users): This stress test (Figure 7 below) is crucial. As the number of users increases (from 2 to 8), the number of resets explodes for traditional methods. MARR maintains a much lower reset count, proving its ability to coordinate crowded spaces.

该图像是图表，展示了在不同用户数量下，各种重置技术（MR2C、R2G、SFR2G、MARR）所需的重置次数。图中包含三个子图，分别对应NS、S2C和TAPF场景，MARR在各场景下的重置次数明显低于其他技术。

Experiment 4 & 5 (Generalization): The authors tested MARR on layouts it wasn't trained on (E4) and room sizes it wasn't trained on (E5).

Result: MARR still outperformed heuristics, though performance dropped slightly compared to trained environments. Specifically, $MARR_10$ (trained on 10m) performed better on a 10m test than $MARR_5$ (trained on 5m), validating the importance of scale.

6.2. User Study Results

A real-world VR study with 28 participants (14 pairs) was conducted.

The following are the results from Figure 11 of the original paper. The left chart shows the Number of Resets (lower is better), and the right chart shows MDbR (Distance walked, higher is better).

Fig. 11. Visualization of the number of resets and MDBR for each reset technique by two users using the TAPF algorithm in a user study. 该图像是一个示意图，展示了两名用户使用 TAPF 算法时不同重置技术下的重置次数和 MDBR 值。左侧为重置次数，包括 MR2C、R2G 和 MARR 三种方法的对比；右侧为 MDBR 结果，明显显示 MARR 在多用户环境中的优势。

Analysis:
- Resets: MARR reduced resets significantly compared to MR2C and R2G ( $p < .017$ ).
- Distance: Users walked significantly farther between interruptions with MARR.
- Subjective Metrics: No significant difference in motion sickness (SSQ) or presence. This is positive, meaning MARR improves efficiency without causing nausea.

6.3. MARL vs. Single-Agent RL (Ablation)

The authors compared MARR (Multi-Agent) against MARR_S (Single-Agent RL using Soft Actor-Critic, SAC).

The following figure (Figure 12) illustrates this comparison.

MARR_S (Green): Performs similarly to the heuristic SFR2G.
MARR (Red): Performs significantly better.
Conclusion: This proves that the Multi-Agent component is vital. Agents need to be "aware" of other agents' policies to cooperate effectively; simply treating others as moving obstacles (Single-Agent approach) is insufficient.

$Fig. 12. Visualization of the number of resets in MARL evaluation performed by two users in $1 0 \\mathrm { m } \\times 1 0 \\mathrm { m }$ physical space with complex obstacle type, using the TAPF algorithm and different reset techniques, respectively.$ 该图像是图表，展示了不同重置技术在 10 ext{ m} imes 10 ext{ m} 物理空间中进行的 MARL 评估下的重置次数。各重置技术包括 MR2C、R2G、SFR2G、MARR_S 和 MARR，结果表明 MARR 技术在重置次数上表现最佳。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces MARR, a Multi-Agent Reinforcement Learning framework for optimizing reset directions in Redirected Walking.

Innovation: It moves beyond static heuristics by learning dynamic, cooperative strategies for multi-user environments.
Performance: MARR consistently reduces the number of resets and increases walking distance across various room sizes, obstacle layouts, and user densities.
Versatility: It works effectively with different underlying steering algorithms (S2C, APF).

7.2. Limitations & Future Work

Limitations identified by authors:

Generalization: While MARR handles slight variations, it performs best when trained on the specific layout it will be used in. A drastically different room might require retraining.
User Compliance: In the user study, some novices struggled to follow the exact reset rotation instructed by the interface.
Heterogeneous Agents: The current model assumes all users use the same policy. It hasn't been tested with mixed groups (e.g., one user using MARR, another using R2G).

Future Directions:

Curriculum Learning: To improve generalization across different room shapes without retraining.
Position Resets: Currently, MARR only optimizes rotation (Turn in Place). Future work could optimize position (walking to a specific spot for the reset).

7.3. Personal Insights & Critique

Critique of Method: The integration of the "Visible Area" reward ( $R_a$ ) is a smart heuristic injection into the RL process. Relying solely on the sparse penalty of resetting ( $R_r$ ) would likely make convergence very slow. $R_a$ gives the agent immediate geometric feedback.
Impact: This paper is significant for Location-Based VR (LBVR), such as VR arcades or training warehouses, where maximizing space efficiency for multiple concurrent users is directly tied to revenue and throughput.
Potential Issue: The requirement for "Centralized Training" implies the system needs global knowledge of all users during training. While "Decentralized Execution" solves this for deployment, the training phase is computationally expensive and requires simulating exact user behaviors, which might be hard to model perfectly for unpredictable humans.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.