Paper status: completed

HG-DAgger: Interactive Imitation Learning with Human Experts

Published:10/06/2018

Interactive Imitation Learning (1)DAgger Algorithm (1)Safety Threshold Learning (1)Model-Uncertainty Risk Metric (1)Autonomous Driving Task (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

HG-DAgger addresses limitations in imitation learning by allowing human experts to take control only when necessary, while incorporating a model-uncertainty-based safety threshold. In autonomous driving tasks, HG-DAgger outperforms DAgger and behavioral cloning in efficiency, sta

Abstract

Imitation learning has proven to be useful for many real-world problems, but approaches such as behavioral cloning suffer from data mismatch and compounding error issues. One attempt to address these limitations is the DAgger algorithm, which uses the state distribution induced by the novice to sample corrective actions from the expert. Such sampling schemes, however, require the expert to provide action labels without being fully in control of the system. This can decrease safety and, when using humans as experts, is likely to degrade the quality of the collected labels due to perceived actuator lag. In this work, we propose HG-DAgger, a variant of DAgger that is more suitable for interactive imitation learning from human experts in real-world systems. In addition to training a novice policy, HG-DAgger also learns a safety threshold for a model-uncertainty-based risk metric that can be used to predict the performance of the fully trained novice in different regions of the state space. We evaluate our method on both a simulated and real-world autonomous driving task, and demonstrate improved performance over both DAgger and behavioral cloning.

Mind Map

In-depth Reading

English Analysis~9 min read · 9,444 chars

1. Bibliographic Information

1.1. Title

HG-DAgger: Interactive Imitation Learning with Human Experts

1.2. Authors

Michael Kelly
Chelsea Sidrane
Katherine Driggs-Campbell
Mykel J. Kochenderfer
Affiliations: Stanford University and SAIC Innovation Center. The authors are researchers in the fields of aeronautics, astronautics, and electrical engineering, focusing on intelligent systems and safety.

1.3. Journal/Conference

The paper was published in the proceedings of a major robotics conference, widely cited as the 2019 IEEE International Conference on Robotics and Automation (ICRA) or 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) depending on the specific version (the arXiv date suggests late 2018). The provided source is the arXiv preprint (v2). These venues are premier conferences in the field of robotics and automation.

1.4. Publication Year

Published (UTC): 2018-10-05 (Preprint version).

1.5. Abstract

This paper addresses the limitations of Imitation Learning (IL), specifically the issues of data mismatch and compounding errors in Behavioral Cloning (BC). While the DAgger algorithm mitigates these by mixing expert and novice actions, it is problematic when the expert is a human. DAgger's random switching creates safety hazards and "actuator lag" perception, degrading human performance. The authors propose HG-DAgger (Human-Gated DAgger), a variant where the human expert explicitly gates control, taking over only when the system enters unsafe states. Additionally, the method learns a safety threshold based on model uncertainty (risk) to predict novice performance. Experiments in simulation and on a real autonomous vehicle demonstrate that HG-DAgger outperforms BC and standard DAgger in safety and performance.

1.6. Original Source Link

Link: https://arxiv.org/abs/1810.02890
Status: Preprint/Published Paper.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Autonomous systems (like self-driving cars) need to learn complex behaviors. Imitation Learning (IL), where a robot learns to mimic a human expert, is a popular approach. However, the simplest form, Behavioral Cloning (BC), suffers from "compounding errors"—if the robot makes a small mistake and drifts off the path, it encounters states it never saw during training (data mismatch), leading to failure.
Existing Gap: The DAgger (Dataset Aggregation) algorithm solves this by iteratively collecting data where the robot (novice) drives, and the expert provides corrective labels. However, standard DAgger uses a probabilistic switch (weighted coin toss $\beta$ ) to decide who controls the robot at each timestep.
Human Factor Challenge: When the expert is a human, this random switching is disorienting. Humans struggle with "partial" control or rapid switching, perceiving it as actuator lag or fighting the controls. This can lead to Pilot-Induced Oscillations, where the human overcorrects, degrading the quality of the training data and compromising safety during the learning process.

2.2. Main Contributions / Findings

HG-DAgger Algorithm: A novel interactive imitation learning framework designed specifically for human experts. Instead of random switching, the human acts as a "Gate," maintaining full control during recovery maneuvers and leaving the novice alone when safe. This results in higher-quality data and safer training.
Uncertainty-Based Risk Metric: The method utilizes an ensemble of neural networks to estimate "doubt" (epistemic uncertainty). It proposes a data-driven method to learn a specific safety threshold ( $\tau$ ) for this metric based on human intervention points.
Empirical Validation: The method was validated in both a high-fidelity simulation and a real-world autonomous vehicle task (lane keeping with obstacles).
- Result: HG-DAgger achieved lower collision and road departure rates compared to BC and standard DAgger.
- Result: The learned risk metric effectively identifies unsafe regions of the state space without needing a ground-truth map.

3.1. Foundational Concepts

Imitation Learning (IL): A type of machine learning where an agent learns a policy (mapping from observations to actions) by mimicking demonstrations provided by an expert (e.g., a human driver).
Behavioral Cloning (BC): The simplest form of IL, treated as a supervised learning problem. The agent is trained to minimize the error between its action and the expert's action for a given state.
- Analogy: Like a student memorizing answers to a test. If the test questions (states) change slightly (drift), the student fails because they don't know how to recover.
DAgger (Dataset Aggregation): An iterative algorithm.
1. Train a policy using available data.
2. Run the current policy to collect new trajectories.
3. Ask the expert to label what they would have done in these new states.
4. Add data to the dataset and retrain.
- Key Innovation: This exposes the robot to its own mistakes during training so it learns how to recover.
Covariate Shift (Data Mismatch): The difference between the distribution of states encountered by the expert (perfect driving) and the novice (wobbly driving). BC fails because the training data (expert distribution) doesn't match the test data (novice distribution).
Epistemic Uncertainty (Doubt): Uncertainty arising from a lack of knowledge or data in a certain region. In this paper, it is approximated using an Ensemble of neural networks. If the networks in the ensemble disagree significantly on the action, "doubt" is high, implying the model hasn't seen enough data in that situation.

3.2. Previous Works

Standard DAgger (Ross & Bagnell, 2010): The foundational algorithm. It uses a mixing parameter $\beta$ $β$ to sample actions from the expert or novice.
- Limitation cited: Requires the expert to provide labels without being fully in control (if $\beta < 1$ ) or involves rapid switching, which is unsafe for humans.
EnsembleDAgger (Menda et al., 2018): A probabilistic extension that uses an ensemble of networks to measure risk.
- Connection: The current paper builds on this by using the ensemble variance idea but improves the mechanism for selecting the safety threshold and managing human control.
Human-Centric vs. Robot-Centric Sampling:
- Robot-Centric (RC): The robot drives, expert corrects. Good for robust data but dangerous during training.
- Human-Centric: Human drives. Safe, but suffers from data mismatch.
- HG-DAgger: Bridges this gap by allowing the robot to drive (RC) but giving the human absolute veto power (Safety).

3.3. Differentiation Analysis

vs. DAgger: DAgger uses a mathematical schedule ( $\beta$ ) to switch control. HG-DAgger gives the human the switch. This acknowledges that humans are not just "labeling functions" but dynamic controllers who need consistent feedback.
vs. Confidence-Based Autonomy (Chernova & Veloso, 2009): That method allows the novice to ask for help when confident is low. HG-DAgger allows the expert to intervene when they perceive danger. This is crucial because a novice might be "confidently wrong" (low uncertainty but high error), whereas a human expert can spot actual danger.

4. Methodology

4.1. Principles

The core philosophy of HG-DAgger is to harmonize the mathematical needs of imitation learning (visiting states induced by the novice to correct errors) with the physiological and psychological needs of a human expert (need for consistent control authority to avoid oscillation and stress).

Instead of stochastic mixing, the system operates in two binary modes:

Novice Control: The robot drives. The human monitors.
Expert Control: The human drives. This happens only when the human deems the situation "unsafe."

Data is collected only during the transition and recovery phases performed by the expert, ensuring that the labels provided are high-quality recovery actions.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Gating Mechanism

The system is modeled as a switched system controlled by a "gating function" operated by the human.

Let $\pi_H$ be the human expert policy and $\pi_{N_i}$ be the novice policy at training epoch $i$ . The human defines a set of "permitted" (safe) states, denoted as $\mathcal{P}$ . The gating function $g(x_t)$ determines who is in control at state $x_t$ :

$g ( x _ { t } ) = \mathbb { 1 } [ x _ { t } \notin \mathcal { P } ]$

Symbol Explanation:
- $x_t$ : The full state of the system at time $t$ .
- $\mathcal{P}$ : The set of safe states.
- $\mathbb{1}[\cdot]$ : An indicator function. It equals 1 if the condition is true (state is not in safe set $\mathcal{P}$ , i.e., unsafe), and 0 otherwise.
- Meaning: If the state is unsafe ( $x_t \notin \mathcal{P}$ ), the gate is 1 (Expert acts). If safe, the gate is 0 (Novice acts).

4.2.2. The Rollout Policy

Based on the gating function, the actual policy $\pi_i$ executed during the data gathering rollout is defined as:

$\pi _ { i } ( x _ { t } ) = g ( x _ { t } ) \pi _ { H } ( x _ { t } ) + ( 1 - g ( x _ { t } ) ) \pi _ { N _ { i } } ( o _ { t } )$

Symbol Explanation:
- $\pi_i(x_t)$ : The action applied to the environment at time $t$ .
- $\pi_H(x_t)$ : The expert's action (uses full state $x_t$ ).
- $\pi_{N_i}(o_t)$ : The novice's action (uses observation $o_t = \mathcal{O}(x_t)$ ).
- Integration: The novice relies on sensors ( $\mathcal{O}$ ), while the human expert might see the full world ( $x_t$ ). The formula mathematically enforces that only one agent controls the actuators at any instant, preventing the "fighting" over controls seen in standard DAgger.
  
  The following figure (Figure 1 from the original paper) illustrates this control loop:
  
  该图像是HG-DAGGER的控制循环示意图。图中展示了人类专家策略 H和新手策略 N通过一个门控函数相互作用，计算出的动作 $a$ 送往环境。环境接受状态 $x$ 并输出观察 $o$ ，展示了人机协作的结构。

4.2.3. Data Collection Strategy

A critical distinction in HG-DAgger is when data is recorded. To avoid "polluting" the dataset with novice errors or passive human behavior, labels are collected specifically during human interventions.

Let $\xi_i$ be the trajectory (sequence of states) in epoch $i$ . The dataset collected in this epoch, $\mathcal{D}_i$ , is:

$\mathcal { D } _ { i } = \{ ( \mathcal { O } ( x _ { t } ) , \pi _ { H } ( x _ { t } ) ) ~ | ~ g ( x _ { t } ) = 1 , x _ { t } \in \xi _ { i } \}$

Symbol Explanation:
- $\mathcal{O}(x_t)$ : The observation recorded (input to neural net).
- $\pi_H(x_t)$ : The expert's action (label).
- $g(x_t)=1$ : The condition that data is only added when the expert is effectively in control (unsafe/recovery states).
Intuition: The novice drives the car off the road (safe $\rightarrow$ unsafe). The human grabs the wheel ( $g=1$ ). The system records the images and the human's steering corrections. Once back on track ( $g=0$ ), recording stops. This teaches the novice specifically how to recover from mistakes.

4.2.4. Risk Metric (Doubt)

The paper proposes measuring the "doubt" of the novice to estimate risk. The novice is trained not as a single network, but as an ensemble of neural networks.

Let $C_t$ be the covariance matrix of the outputs of the ensemble members given input $o_t$ . The doubt $d_N(o_t)$ is defined as the $\ell_2$ -norm of the diagonal of this matrix (the variances):

$d _ { N } ( o _ { t } ) = \lVert \mathrm { d i a g } ( C _ { t } ) \rVert _ { 2 }$

Symbol Explanation:
- $C_t$ : Covariance matrix of the ensemble's action predictions.
- $\mathrm{diag}(C_t)$ : The vector containing the variances of each action dimension (e.g., variance in steering, variance in acceleration).
- $\lVert \cdot \rVert_2$ : The Euclidean norm (square root of sum of squares).
- Concept: If the networks in the ensemble predict very different actions for the same image, variance is high, "doubt" is high, and the situation is likely novel or risky.

4.2.5. Learning the Safety Threshold ( $\tau$ )

A key contribution is automating the definition of "safe" vs. "unsafe" using this doubt metric. The system learns a threshold $\tau$ purely from observing when the human chooses to intervene.

When the expert takes control (intervenes), the novice's doubt at that instant is recorded in a logfile $\mathcal{T}$ . The threshold $\tau$ is computed as the average doubt of the interventions in the final quartile of the training process:

$\tau = { \frac { 1 } { \mathbf { l e n } ( { \cal T } ) / 4 } } \sum _ { i = \lfloor . 7 5 N \rfloor } ^ { N } ( { \cal T } [ i ] )$

Symbol Explanation:
- $\mathcal{T}$ : The list of recorded doubt values at the exact moments of human intervention.
- $N = \mathbf{len}(\mathcal{T})$ : Total number of interventions recorded.
- $\lfloor .75 N \rfloor$ : The index marking the start of the last 25% of interventions.
- Logic: The authors use only the later interventions because, in early training, the novice is terrible everywhere, so interventions happen at both easy and hard states. In later epochs, the novice is decent, so interventions only happen at truly difficult/ambiguous states (the "boundary" of the safe set).

4.2.6. Algorithmic Flow Summary

The complete HG-DAgger process is:

Initialize: Train initial novice $\pi_{N_1}$ using Behavioral Cloning on offline data $\mathcal{D}_{BC}$ .
Loop (Epochs): 3. Rollout: Let novice drive. 4. Monitor: Human watches. If unsafe, Human drives ( $g=1$ ). 5. Record: During human driving, save $(o_t, a_{expert})$ to $\mathcal{D}_{epoch}$ and save doubt to $\mathcal{T}$ . 6. Aggregate: $\mathcal{D} \leftarrow \mathcal{D} \cup \mathcal{D}_{epoch}$ . 7. Train: Train new novice ensemble $\pi_{N_{i+1}}$ on $\mathcal{D}$ . 8. Update Threshold: Re-calculate $\tau$ using the intervention log.

5. Experimental Setup

5.1. Task & Environment

The experiments focus on autonomous driving.

Task: Navigate a two-lane, one-way road populated with stationary obstacle cars. The ego vehicle must weave between obstacles without colliding or leaving the road.
Simulation: A bicycle model simulation.
Real World: An MG-GS vehicle equipped with LiDAR and high-fidelity localization.
- Hybrid Setup: The "obstacles" in the real-world test were virtual (simulated), but the ego vehicle was real and moving on a real track. A safety driver sat in the car, while the "expert" driver controlled the car via a remote interface with a steering wheel and pedals (see Figure 2).
  
  The following figure (Figure 2 from the paper) shows the real-world test vehicle and the interface used by the expert:
  
  该图像是包含两幅图像的插图，左侧展示了一辆装备传感器的自动驾驶汽车在模拟环境中行驶，右侧则是人类驾驶员在实际驾驶车辆中的操控情形。这些图像展示了HG-DAgger方法在自动驾驶任务中的应用。

5.2. Datasets

Initialization Data: 10,000 action labels collected via Behavioral Cloning (expert driving normally).
Augmentation Data: For each method (HG-DAgger, DAgger), an additional 5 epochs were run, collecting 2,000 labels per epoch.
Input Data ( $o_t$ ):
- Distance from median ( $y$ )
- Orientation ( $\theta$ )
- Speed ( $s$ )
- Distances to lane edges ( $l_l, l_r$ )
- Distances to nearest leading obstacles ( $d_l, d_r$ )

5.3. Evaluation Metrics

Collision Rate:
- Definition: The frequency of the vehicle hitting an obstacle per meter traveled.
- Formula (Implied): $\text{Rate} = \frac{\text{Total Collisions}}{\text{Total Distance Traveled (meters)}}$ .
Road Departure Rate:
- Definition: The frequency of the vehicle's center of mass leaving the defined road boundaries per meter.
Bhattacharyya Distance:
- Definition: A statistical measure used to quantify the similarity between two probability distributions. Here, it measures how "human-like" the novice's steering angle distribution is compared to the expert's distribution. Lower is better (more similar).
- Formula: $D_B(p, q) = - \ln \left( \sum_{x \in X} \sqrt{p(x) q(x)} \right)$
- Symbol Explanation: p(x) and q(x) are the probabilities of steering angle $x$ in the novice and expert distributions, respectively.
Doubt Threshold Accuracy: Evaluated by checking if the learned threshold $\tau$ correctly separates free space (safe) from occupied space (unsafe) on a risk map.

5.4. Baselines

Behavioral Cloning (BC): Standard supervised learning on expert data. Serves as the lower bound / starting point.
DAgger: The standard algorithm with probabilistic mixing parameter $\beta$ . $\beta$ started at 0.85 and decayed by 0.85 per epoch. This represents the state-of-the-art in interactive IL but suffers from the human factors issues discussed.

6. Results & Analysis

6.1. Simulation Results (Learning Performance)

HG-DAgger demonstrated superior learning stability and efficiency.

Road Departure: As shown in Figure 3, HG-DAgger (Green) rapidly reduces the departure rate to near zero. Standard DAgger (Blue) shows instability—its error rate actually increases in later epochs.
- Analysis: The authors hypothesize this instability in DAgger comes from "perceived actuator lag" affecting the human expert's ability to provide consistent labels when control is stochastically switched.
  
  The following figure (Figure 3 from the original paper) shows the road departure rates:
  
  该图像是图表，展示了不同专家标签下，三种算法（行为克隆、DAgger、HG-DAgger）的平均道路偏离率。横轴表示专家标签，纵轴表示平均道路偏离率，误差条代表标准差。
Collision Rate: Figure 4 shows a similar trend. HG-DAgger converges to a negligible collision rate, while DAgger fluctuates.

The following figure (Figure 4 from the original paper) shows the collision rates:

该图像是一个图表，展示了不同专家标签下的平均碰撞率随训练周期的变化情况。错误条表示标准差，比较了行为克隆、DAgger 和 HG-DAgger 三种方法的表现。

6.2. Safety & Risk Metric Validation

To prove the "doubt" metric $d_N$ and threshold $\tau$ are meaningful, the authors tested the novice by initializing it in two different sets of states:

Inside $\hat{\mathcal{P}}$ : States where doubt $\le \tau$ .
Outside $\hat{\mathcal{P}}$ : States where doubt $> \tau$ (but still physically recoverable).

The results are presented in Table I.

Introductory Sentence: The following are the results from Table I of the original paper, comparing performance when initialized inside versus outside the estimated permissible set:

Initialization	Collision Rate	Road Departure Rate	Departure Duration
Within $\hat{\mathcal{P}}$	$0.607 \times 10^{-3}$	$0.607 \times 10^{-3}$	1.630s
Outside $\hat{\mathcal{P}}$	$7.533 \times 10^{-3}$	$12.092 \times 10^{-3}$	3.740s

Analysis: The collision rate is over 12x higher and road departure rate 20x higher when initialized in "high doubt" regions. This strongly validates that the learned $\tau$ correctly identifies dangerous states.

6.3. Real-World Vehicle Results

The real-world tests (though limited in sample size) corroborated the simulation findings.

Introductory Sentence: The following are the results from Table II of the original paper, summarizing the on-vehicle test data:

	# Collisions	Collisions Rate	# Road Departures	Road Departure Rate	Bhattacharyya Metric
Behavioral Cloning	1	$0.973 \times 10^{-3}$	6	$5.837 \times 10^{-3}$	0.1173
DAGGER	1	$1.020 \times 10^{-3}$	1	$1.020 \times 10^{-3}$	0.1057
Human-Gated DAGGER	0	0.0	0	0.0	0.0834

Key Finding: HG-DAgger had zero failures in this test set.
Human-Likeness: The Bhattacharyya metric (0.0834) is lowest for HG-DAgger, indicating its steering behavior was most similar to the human expert.

Trajectory Visualization: Figure 5 visually demonstrates this. The HG-DAgger path (bottom) is smooth and stays in the lane. BC (top) drives off the road. DAgger (middle) weaves significantly.

Fig. 5: Trajectory plots of on-vehicle test data. 该图像是图表，展示了不同方法在车辆测试数据上的轨迹对比，包括行为克隆、DAgger、HG-DAgger和人类驾驶的结果。每种方法的轨迹表现都是用粉红色线条表示，横轴为距离（米），纵轴为相关指标，通过这张图可以直观地比较各方法的表现差异。

6.4. Risk Map Analysis

The authors generated a "Risk Map" by computing the novice's doubt at every point in the workspace.

Visual Analysis: As seen in Figure 6, the map generated using the learned threshold $\tau$ $τ$ (Center) perfectly highlights the obstacles (white boxes) and road edges as "unsafe" (red), while the lane centers are "safe" (blue).
- Using a larger threshold (Right, $3\tau$ ) misses obstacles (false negatives).
- Using a smaller threshold (Left, $\tau/3$ ) marks everything as unsafe (false positives).
  
  $Fig. 6: Risk maps generated for a policy trained on the test vehicle. The center map was generated using the variance threshold $\\tau$ learned from human interventions. The purple box represents the ego vehicle, and the white boxes represent other vehicles. Blue is safe and red is unsafe.$ 该图像是风险图，用于展示在不同阈值下训练的策略的安全性。左、中、右三个图分别显示了 $3\tau$ 、 $\tau$ 和 $\frac{1}{3}\tau$ 的风险分布。紫色方框代表自车辆，白色方框表示其他车辆。蓝色区域表示安全，而红色区域则表示不安全。所有图的x轴表示到中线的距离，y轴表示沿道路的距离。

Figure 7 quantifies this pixel-wise classification accuracy, showing the learned threshold (dashed line) is near-optimal for maximizing F1-score.

Fig. 7: Performance on the pixelwise free vs. occupied space classification task as a function of the doubt threshold used. 该图像是图表，展示了在不同怀疑阈值下，像素级自由空间与占用空间分类任务的性能表现。图中显示了微平均 F1 分数、平均 F1 分数、平衡准确率、自由空间 F1 分数以及占用空间 F1 分数的变化情况，虚线表示学习到的怀疑阈值。

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces HG-DAgger, a practical algorithm for interactive imitation learning with human experts. By shifting the "gating" authority from a random coin toss (as in DAgger) to the human expert, the method:

Improves training safety by ensuring an expert is in control during dangerous maneuvers.
Improves data quality by eliminating human confusion caused by actuator lag.
Enables the learning of a calibrated risk metric (doubt threshold $\tau$ ) that accurately predicts performance drops in the novice policy. The results on a real autonomous vehicle confirm that this approach yields safer, more human-like driving policies compared to baselines.

7.2. Limitations & Future Work

Limitation: The method still relies on the human to be vigilant. If the human fails to intervene in time, the system can still crash during training.
Limitation: The "doubt" metric relies on ensemble variance. While effective, ensembles increase computational cost during inference compared to a single network.
Future Work: The authors propose using the learned risk metric at test time to automatically switch control to a conservative fallback controller (safe stop) if the novice becomes confused, effectively automating the gating function for deployment.

7.3. Personal Insights & Critique

Human-Centric Design: This paper is an excellent example of "Human-Robot Interaction (HRI) meets Machine Learning." It recognizes that the source of the data (the human) is a biological system with cognitive limits (reaction time, confusion). Purely mathematical algorithms (like standard DAgger) often fail in the real world because they ignore these human factors.
Practicality of "Doubt": The idea of learning the threshold $\tau$ from interventions is very clever. It turns the implicit knowledge of the expert ("I feel I need to take over now") into an explicit mathematical boundary. This is highly transferable to other domains like robotic surgery or drone piloting.
Critique: The paper assumes the "safe set" $\mathcal{P}$ is static. in dynamic environments with moving pedestrians, the "doubt" might need to capture temporal dynamics more explicitly, which might be challenging for simple ensembles. Additionally, the reliance on an ensemble implies a linearly increasing computational cost with the number of models, which might be a bottleneck for high-frequency control loops on embedded hardware.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.