Paper status: completed

Stable-Predictive Optimistic Counterfactual Regret Minimization

Published:02/14/2019

Counterfactual Regret Minimization (1)Stable-Predictive Regret Minimization (1)Large-Scale Game Solving (1)Convergence Rate Optimization (1)Stability in Decision Trees (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a new CFR variant achieving $O(T^{-3/4})$ convergence rate for large-scale extensive-form games. By combining advances in predictive and stable regret minimization, the concept of 'stable-predictivity' enhances algorithm performance beyond traditional CFR.

Abstract

The CFR framework has been a powerful tool for solving large-scale extensive-form games in practice. However, the theoretical rate at which past CFR-based algorithms converge to the Nash equilibrium is on the order of $O(T^{-1/2})$ , where $T$ is the number of iterations. In contrast, first-order methods can be used to achieve a $O(T^{-1})$ dependence on iterations, yet these methods have been less successful in practice. In this work we present the first CFR variant that breaks the square-root dependence on iterations. By combining and extending recent advances on predictive and stable regret minimizers for the matrix-game setting we show that it is possible to leverage "optimistic" regret minimizers to achieve a $O(T^{-3/4})$ convergence rate within CFR. This is achieved by introducing a new notion of stable-predictivity, and by setting the stability of each counterfactual regret minimizer relative to its location in the decision tree. Experiments show that this method is faster than the original CFR algorithm, although not as fast as newer variants, in spite of their worst-case $O(T^{-1/2})$ dependence on iterations.

Mind Map

In-depth Reading

English Analysis~32 min read · 37,952 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is Stable-Predictive Optimistic Counterfactual Regret Minimization. It introduces a new variant of the Counterfactual Regret Minimization (CFR) framework designed to achieve a faster theoretical convergence rate to Nash equilibrium in extensive-form games.

1.2. Authors

Gabriele Farina: Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
Christian Kroer: IEOR Department, Columbia University, New York, NY.
Noam Brown: Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.
Tuomas Sandholm: Computer Science Department, Carnegie Mellon University, Pittsburgh, PA.

The authors are affiliated with prominent computer science and operations research departments at leading universities, suggesting expertise in game theory, artificial intelligence, and optimization.

1.3. Journal/Conference

This paper was published as an Arxiv Preprint. While Arxiv is a reputable platform for sharing pre-prints of scientific research, it is not a peer-reviewed journal or conference. However, the authors (especially Noam Brown and Tuomas Sandholm) are highly influential researchers in the field of AI for imperfect-information games, known for their work on Libratus and Pluribus. Many of their works initially appear on Arxiv before formal publication in top-tier conferences or journals like NIPS, AAAI, or Science.

1.4. Publication Year

February 14, 2019 (Arxiv preprint date: 2019-02-13T16:18:21.000Z).

1.5. Abstract

The paper addresses the theoretical convergence rate limitation of traditional Counterfactual Regret Minimization (CFR) algorithms, which typically converge to a Nash equilibrium at a rate of $O(T^{-1/2})$ (where $T$ is the number of iterations). This rate is slower than the $O(T^{-1})$ achievable by some first-order methods (FOMs), though FOMs have been less successful in practice for extensive-form games (EFGs). This work introduces the first CFR variant that surpasses the square-root dependence, achieving a $O(T^{-3/4})$ convergence rate. It accomplishes this by integrating and extending concepts from predictive and stable regret minimizers (specifically optimistic regret minimizers) developed for matrix games. The core innovation involves a new notion of stable-predictivity and dynamically setting the stability of each counterfactual regret minimizer relative to its position within the decision tree. Experimental evaluation on Heads-Up No-Limit Texas Hold'em (HUNL) subgames shows that the proposed method is faster than vanilla CFR, but it does not outperform newer, state-of-the-art CFR variants (like Discounted CFR) in practice, despite their theoretical worst-case $O(T^{-1/2})$ rate.

1.6. Original Source Link

Official Source Link: https://arxiv.org/abs/1902.04982v1 PDF Link: https://arxiv.org/pdf/1902.04982v1.pdf Publication Status: Arxiv Preprint.

2. Executive Summary

2.1. Background & Motivation

The paper is motivated by a significant theoretical-practical disconnect in solving large-scale extensive-form games (EFGs), particularly in domains like poker.

Core Problem: The Counterfactual Regret Minimization (CFR) framework has been the practical state-of-the-art for finding approximate Nash equilibria in large EFGs for over a decade. It has powered AI breakthroughs in poker, such as Libratus and DeepStack. However, its theoretical worst-case convergence rate is $O(T^{-1/2})$ , where $T$ is the number of iterations. This means the accuracy of the solution improves proportionally to the inverse square root of time.
Importance and Challenges: Achieving faster convergence is critical for solving increasingly complex games, reducing computation time, and enabling more sophisticated AI agents. The $O(T^{-1/2})$ rate is a theoretical bottleneck compared to some first-order methods (FOMs) that can achieve $O(T^{-1})$ (i.e., linear dependence on iterations, which is much faster). The challenge is that these theoretically faster FOMs have historically performed worse than CFR variants in practical settings, despite their asymptotic advantage. This creates a gap: highly practical CFR algorithms are theoretically slow, while theoretically fast FOMs are practically slow in this domain.
Paper's Entry Point & Innovative Idea: The paper's innovative idea is to bridge this gap by improving the theoretical convergence rate within the CFR framework. It aims to achieve this by incorporating optimistic regret minimization techniques, which have previously shown promise in simpler matrix game settings for accelerating convergence. The key insight is to leverage predictive and stable regret minimizers at each local information set within the CFR algorithm, adapting their stability based on their position in the game tree.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

First CFR Variant to Break the Square-Root Barrier: This work presents the first CFR algorithm that theoretically converges faster than $O(T^{-1/2})$ , achieving an $O(T^{-3/4})$ rate. This is a crucial step towards reconciling the theoretical convergence rates of CFR and first-order methods.
Leveraging Optimistic Regret Minimizers within CFR: The paper demonstrates how to effectively integrate optimistic regret minimizers (specifically, Optimistic Follow-the-Regularized-Leader (OFTRL)) into the hierarchical structure of CFR. This involves setting up optimistic counterfactual regret minimizers at each information set while preserving the necessary properties for acceleration.
Introducing Stable-Predictivity: A new notion of stable-predictivity is introduced for regret minimizers. This concept combines a bound on the regret based on prediction quality with a condition that decisions change slowly over iterations (stability).
Adaptive Stability for Counterfactual Regret Minimizers: The authors show that the stability of each local counterfactual regret minimizer must be set relative to its location in the decision tree, with deeper minimizers requiring more stability. This hierarchical adjustment of stability parameters is crucial for the overall convergence guarantee.
Experimental Validation: Experiments on Heads-Up No-Limit Texas Hold'em (HUNL) subgames confirm that the proposed Stable-Predictive Optimistic CFR (OFTRL) algorithm is faster than vanilla CFR. However, it does not surpass the empirical performance of newer CFR variants like Discounted CFR (DCFR), which, despite having the same worst-case theoretical rate as vanilla CFR ( $O(T^{-1/2})$ ), performs exceptionally well in practice for many games. This highlights a persistent gap between theoretical worst-case guarantees and empirical average-case performance.

3.1. Foundational Concepts

To understand this paper, a foundational grasp of game theory, online learning, and extensive-form games is essential.

Extensive-Form Games (EFGs):
- Concept: An extensive-form game (EFG) is a model of a game where players make decisions sequentially, and information about past actions or states may be imperfect. It's represented as a tree structure.
- Decision Nodes ( $\mathcal{I}$ ): These are points in the game tree where a player must make a choice (e.g., check, bet, fold in poker).
- Observation Nodes ( $\mathcal{K}$ ): These represent points where a player might receive information or a signal from the game (e.g., seeing community cards or an opponent's action).
- Actions ( $A_j$ ): The set of choices available at a decision node $j$ .
- Strategies: A complete plan of action for a player, specifying what action to take at every decision node given any possible sequence of observations.
- Perfect Recall: A property of players in EFGs meaning they never forget their past moves or observations. This simplifies strategy representation and analysis.
- Zero-Sum Game: A game where the total gains and losses of all players sum to zero. One player's gain is another's loss, making it a purely competitive scenario.
Nash Equilibrium:
- Concept: A Nash equilibrium is a stable state in a game where no player can gain by unilaterally changing their strategy, assuming the other players' strategies remain unchanged. In zero-sum games, computing a Nash equilibrium is equivalent to finding a pair of strategies $(x^*, y^*)$ such that player 1 cannot improve their expected payoff by changing $x^*$ given $y^*$ , and player 2 cannot improve their expected payoff by changing $y^*$ given $x^*$ .
- Relevance: Finding a Nash equilibrium is the primary objective in solving zero-sum games, as it represents optimal play for rational agents.
Counterfactual Regret Minimization (CFR):
- Concept: CFR is an iterative algorithm designed to find approximate Nash equilibria in large-scale extensive-form games. It operates by iteratively traversing the game tree and computing regret for each possible action at each information set.
- Information Set: In games with imperfect information, an information set is a set of decision nodes that a player cannot distinguish between. When it's a player's turn to act, they know they are in one of the nodes within an information set, but they don't know exactly which one. CFR computes regret for actions within information sets.
- Regret Matching: A common strategy update rule in CFR. If an action's cumulative regret (total loss suffered from not taking the best action in hindsight) is positive, the probability of choosing that action is increased. If it's negative or zero, the probability might be reduced or stay the same.
- Counterfactual Regret: A localized notion of regret specific to CFR. For a particular information set, it measures the difference between the payoff obtained by following the current strategy and the payoff that would have been obtained if the player had consistently chosen a specific alternative action only at that information set, assuming all other actions throughout the rest of the game tree remain the same. This allows the problem of overall regret minimization to be decomposed into local regret minimization problems.
Online Convex Optimization (OCO):
- Concept: OCO is a framework for sequential decision-making where a decision-maker chooses an action (decision vector) at each time step, and then observes a loss function (loss vector). The goal is to minimize cumulative regret.
- Decisions ( $x^t$ ): The strategy chosen by the agent at iteration $t$ .
- Loss Vectors ( $\ell^t$ ): The cost or negative reward associated with an action, revealed after the decision is made.
- Cumulative Regret ( $R^T$ ): The difference between the total loss incurred by the agent's sequence of decisions and the total loss of the best fixed decision in hindsight over $T$ iterations. Mathematically: $R^T := \sum_{t=1}^T \langle \ell^t, x^t \rangle - \operatorname*{min}_{\tilde{x} \in \mathcal{X}} \left\{ \sum_{t=1}^T \langle \ell^t, \tilde{x} \rangle \right\}$ Where $\mathcal{X}$ is the set of feasible decisions, $\ell^t$ is the loss vector at time $t$ , $x^t$ is the decision made at time $t$ , and $\tilde{x}$ is the best fixed decision in hindsight.
Predictive (Optimistic) Regret Minimization:
- Concept: An extension of OCO where the decision-maker has access to a prediction ( $m^t$ ) of the upcoming loss vector ( $\ell^t$ ) before making the decision $x^t$ . The goal is to leverage these predictions to achieve lower regret, especially if the predictions are accurate. This is also often called optimistic learning.
Sequence Form:
- Concept: An alternative, linear representation of strategies in extensive-form games with perfect recall. Instead of representing strategies as probabilities at each decision node, the sequence form represents the probability of reaching and executing a sequence of actions from the root of the game tree.
- Benefit: Crucially, in sequence form, the expected payoff of an EFG becomes a linear function of the strategy variables, even though it's typically non-linear in behavioral strategies (probabilities at individual nodes). This linearity makes it amenable to convex optimization techniques.
- Bilinear Saddle-Point Problem (BSPP): The problem of finding a Nash equilibrium in a two-player zero-sum EFG can be formulated as a BSPP in sequence form: $\operatorname*{min}_{x \in \mathcal{X}} \operatorname*{max}_{y \in \mathcal{Y}} x^\top A y$ Where $\mathcal{X}$ and $\mathcal{Y}$ are the convex and compact sequence-form strategy spaces for the two players, and $A$ is a matrix encoding game payoffs.
Saddle Point Residual (or Gap):
- Concept: A measure of how close a given strategy profile $(\bar{x}, \bar{y})$ is to being a saddle point (Nash equilibrium). A lower residual indicates a closer approximation.
- Formula: For a point $(\bar{x}, \bar{y}) \in \mathcal{X} \times \mathcal{Y}$ , the residual $\xi$ is defined as: $\xi := \operatorname*{max}_{\hat{y} \in \mathcal{Y}} \bar{x}^\top A \hat{y} - \operatorname*{min}_{\hat{x} \in \mathcal{X}} \hat{x}^\top A \bar{y}$ At a true saddle point, the residual is zero. The average strategies produced by regret minimization algorithms are typically evaluated by their saddle point residual.
Optimistic Follow-the-Regularized-Leader (OFTRL):
- Concept: A specific optimistic regret minimization algorithm that is a variant of the Follow-the-Regularized-Leader (FTRL) framework. FTRL makes decisions by minimizing the sum of past losses plus a regularizer. OFTRL modifies this by incorporating a prediction of the next loss vector into the minimization problem. It's known for achieving faster convergence rates under certain conditions.

3.2. Previous Works

The paper builds upon a rich history of research in game theory, online learning, and AI for games.

Traditional CFR and its Variants:
- CFR (Zinkevich et al., 2007): The foundational algorithm. Known for its practical effectiveness but theoretical worst-case $O(T^{-1/2})$ convergence.
- Monte-Carlo CFR (Lanctot et al., 2009): Extends CFR to handle larger games by sampling portions of the game tree, making it more scalable. Still $O(T^{-1/2})$ .
- CFR+ (Tammelin et al., 2015): A significant improvement to CFR that achieves much faster empirical convergence by applying positive-only regret updates and resetting negative cumulative regrets to zero. Despite its practical success, its worst-case theoretical rate remains $O(T^{-1/2})$ .
- Discounted CFR (DCFR) (Brown & Sandholm, 2019): A more recent and often even faster variant than CFR+ in practice. It discounts past regrets, giving more weight to recent iterations. Like CFR+, its worst-case theoretical rate is $O(T^{-1/2})$ .
- Commonality: All these variants have been instrumental in AI poker breakthroughs (Bowling et al., 2015; Moravík et al., 2017; Brown & Sandholm, 2017b). Their $O(T^{-1/2})$ theoretical rate is a recurring theme, which this paper aims to break.
First-Order Methods (FOMs) for EFGs:
- Concept: These methods use gradient-based optimization techniques to solve bilinear saddle-point problems arising from extensive-form games in sequence form.
- Examples: Techniques like the excessive gap technique (Nesterov, 2005) or mirror prox (Nemirovski, 2004) combined with dilated distance-generating functions (Hoda et al., 2010; Kroer et al., 2015; 2018b) are known to achieve an $O(T^{-1})$ convergence rate.
- Challenge: Despite their theoretical superiority in terms of asymptotic convergence, these first-order methods have often been found to perform worse than newer CFR algorithms (like $CFR+$ and DCFR) in practice (Kroer et al., 2018b;a). This practical gap motivates the current paper to improve CFR's theoretical rate while retaining its practical advantages.
Optimistic Learning / Predictive Regret Minimization:
- Origins: The idea of optimistic learning (or predictive regret minimization) emerged in online convex optimization.
- Key Works:
  - Chiang et al. (2012) introduced the concept of online optimization with gradual variations.
  - Rakhlin & Sridharan (2013a;b) showed that by leveraging cancellations in optimistic mirror descent (OMD), an $O(T^{-1})$ convergence rate can be achieved in matrix games when predictions are accurate (specifically, $m^t = \ell^{t-1}$ ).
  - Syrgkanis et al. (2015) introduced the Optimistic Follow-the-Regularized-Leader (OFTRL) algorithm. They demonstrated that even when players use different algorithms, an $O(T^{-3/4})$ rate can be achieved for matrix games if each algorithm satisfies a stability criterion and leverages predictability of loss inputs (their Regret bounded by Variation in Utilities (RVU) property). This paper explicitly builds on this latter generalization.
- Differentiation: The crucial aspect for this paper is that previous optimistic methods primarily applied to simpler matrix games. Extending these ideas to the complex, tree-structured extensive-form games via the CFR framework is non-trivial and forms the core of this work.
Prior Attempts to Integrate Optimistic Ideas into CFR:
- Burch (2017) explored CFR-like features within $O(T^{-1})$ first-order methods and regret minimizers.
- Brown & Sandholm (2019) experimentally tried optimistic variants of regret minimizers in CFR.
- Key Distinction: The current paper stresses that these prior attempts were only experimental. This paper provides the first rigorous theoretical framework to incorporate optimistic regret minimization into CFR and achieve a proven theoretical speedup.

3.3. Technological Evolution

The evolution of algorithms for solving extensive-form games has generally followed two main paths:

Regret Minimization (CFR family): Starting with CFR (2007), improvements focused on practical efficiency and scalability (e.g., Monte-Carlo CFR, $CFR+$ , Discounted CFR). These methods leverage the hierarchical structure of EFGs well but have been theoretically limited to $O(T^{-1/2})$ .
First-Order Methods (FOMs): These methods treat the game as a bilinear saddle-point problem and apply general convex optimization techniques (e.g., mirror prox, excessive gap technique). They offer a theoretically superior $O(T^{-1})$ rate but have struggled with practical performance in large EFGs.

This paper's work (Stable-Predictive Optimistic CFR) represents a significant step in the first path, aiming to theoretically accelerate CFR itself, rather than trying to make FOMs more CFR-like. It incorporates advanced online learning concepts, specifically optimistic regret minimization, which had previously shown success in matrix games for accelerating convergence. By bringing these predictive and stable properties into CFR's decomposition, it seeks to bridge the theoretical rate gap while maintaining CFR's practical strengths.

3.4. Differentiation Analysis

The core differences and innovations of this paper's approach compared to prior methods are:

Breaking the $O(T^{-1/2})$ Barrier within CFR: Unlike all previous CFR variants (vanilla CFR, Monte-Carlo CFR, CFR+, DCFR), this is the first to rigorously prove a better theoretical worst-case convergence rate ( $O(T^{-3/4})$ ). This is its most significant theoretical differentiation.
Systematic Integration of Optimistic Learning into CFR: Previous attempts to incorporate optimistic ideas into CFR were either experimental or within FOMs. This paper provides a formal framework for embedding stable-predictive regret minimizers at each information set within the CFR recursive decomposition, proving that the overall algorithm inherits the accelerated convergence.
Novel Concept of Stable-Predictivity: The paper introduces a specific definition of stable-predictive regret minimizer that is tailored for the CFR context. This definition ties the regret bound to prediction quality and explicitly incorporates a stability requirement on iterates, with coefficients that are inversely proportional and linked to a chosen stability parameter $\kappa$ . This differs from previous RVU properties (e.g., Syrgkanis et al., 2015) by not assuming $m^t = \ell^{t-1}$ and explicitly requiring the stability property (5) instead of relying on a cancellation term.
Adaptive Stability Parameter Selection: A key innovation is the recursive scheme for setting the stability parameter ( $\kappa_j$ ) for each local regret minimizer based on its depth and position in the game tree (Equations 14-16). This hierarchical tuning of stability is crucial for the inductive proof of the overall algorithm's stable-predictivity.
Focus on CFR's Strengths: Unlike FOMs which achieve $O(T^{-1})$ but often underperform in practice, this work enhances the theoretical footing of CFR, an algorithm known for its practical efficiency and memory management in large games. It aims to get the best of both worlds: faster theory, retained practical benefits.

4. Methodology

4.1. Principles

The core idea behind Stable-Predictive Optimistic Counterfactual Regret Minimization (SP-OCFR) is to accelerate the convergence of the CFR framework by replacing the simple regret minimizers used at each information set with more sophisticated stable-predictive optimistic regret minimizers. The key principles are:

Decomposition of Regret: CFR inherently decomposes the global problem of minimizing regret in an extensive-form game (EFG) into local regret minimization problems at each information set (or decision node). The paper maintains this decomposition.
Optimistic Learning at Local Level: At each local decision node, instead of using a standard (non-predictive) regret minimizer, the algorithm employs an optimistic regret minimizer that can leverage predictions of future losses. This is designed to reduce the cumulative regret at each local sub-problem.
Stable-Predictivity: The chosen local regret minimizers must satisfy a novel stable-predictivity property. This property ensures that:
- Prediction Benefit: Small errors in loss predictions lead to only minimal increases in cumulative regret.
- Stability: The decisions (strategies) produced by the minimizer change slowly over iterations. This "smoothness" in strategy updates is crucial for the overall convergence guarantee when combined across the game tree.
Hierarchical Stability Tuning: The stability parameter of each local optimistic regret minimizer is not fixed but is dynamically set relative to its position (depth) within the game tree. Regret minimizers deeper in the tree are configured to be more stable (i.e., their decisions change less drastically between iterations) to ensure the overall stability of the full strategy.
Inductive Proof: The overall $O(T^{-3/4})$ convergence rate is proven inductively, starting from the leaves of the game tree and working upwards to the root, showing that if the children nodes have stable-predictive properties, the parent node also does, with appropriate scaling of parameters.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology can be understood by first detailing the underlying game representation, then introducing the stable-predictive regret minimizer concept, and finally showing how CFR's recursive structure is adapted to use these minimizers.

4.2.1. Sequential Decision Making and EFG Strategy Spaces (Section 2)

The paper begins by formalizing the structure of a sequential decision process, which is fundamental to extensive-form games.

Tree Structure: A sequential decision process is modeled as a tree with two types of nodes:
- Decision nodes (\mathcal{I}): Where an agent chooses an action. At each $j \in \mathcal{I}$ , the agent selects a strategy $\hat{x}_j \in \Delta^{n_j}$ , which is a probability distribution over the $n_j = |A_j|$ actions available at $j$ .
- Observation nodes (\mathcal{K}): Where the agent might receive a signal or observation. At $k \in \mathcal{K}$ , the agent might receive one of $n_k$ signals from set $S_k$ .
Node Relationships:
- $\rho(j, a)$ : The observation node reached after taking action $a$ at decision node $j$ .
- $\rho(k, s)$ : The decision node reached after observing signal $s$ at observation node $k$ .
- $\mathcal{C}_j := \{\rho(j, a) : a \in A_j\}$ : Set of observation points reachable from $j$ .
- $\mathcal{C}_k := \{\rho(k, s) : s \in S_k\}$ : Set of decision points reachable from $k$ .
- $\mathcal{C}_{ja}$ is shorthand for $\mathcal{C}_{\rho(j, a)}$ .
Expected Loss: The agent incurs an (expected) linear loss $\langle \ell_j, \hat{x}_j \rangle$ at each decision point $j$ . The total expected loss is $\sum_{j \in \mathcal{I}} \pi_j \langle \ell_j, \hat{x}_j \rangle$ , where $\pi_j$ is the probability of reaching $j$ . This $\pi_j$ makes the overall loss non-linear in the individual $\hat{x}_j$ strategies.

Example: Kuhn Poker (Figure 1) The paper provides an illustration using Kuhn poker (Kuhn, 1950), a simplified three-card poker game. The sequential decision process for Player 1 is depicted as:

$Figure 1. The sequential decision process for the first player in the game of Kuhn poker. $\\circledast$ denotes an observation point; small dots represents the end of the decision process.$ 该图像是图示，展示了Kuhn扑克中第一玩家的顺序决策过程。决策树从起点 $X_0$ 开始，分为选择杰克、皇后或国王的路径，后续根据选择可以进行 check 或 raise，最终到达分支 $X_4$ 、 $X_5$ 和 $X_6$ ，表示不同的决策结果。

Figure 1. The sequential decision process for the first player in the game of Kuhn poker. $\\circledast$ denotes an observation point; small dots represents the end of the decision process.

Decision nodes (\mathcal{I}): $\{X_0, X_1, X_2, X_3, X_4, X_5, X_6\}$ .
$n_0 = 1$ , actionsA_{X_0} = {\mathrm{start}}

.
*

n_j = 2 $for all $j \in \mathcal{I} \setminus \{X_0\}$. *$ A_{X_1} = A_{X_2} = A_{X_3} = {\mathrm{check, raise}}

.
*

A_{X_4} = A_{X_5} = A_{X_6} = {\mathrm{fold, call}}

.
*   `Observation points (`\circledast`)`: E.g., after `Player 1 checks`, `Player 2` makes a decision, which `Player 1` observes.

**Sequence Form for Sequential Decision Processes:**
To address the non-linearity of expected loss in behavioral strategies, the paper utilizes the `sequence form`.
*   **Concept:** Instead of local probabilities, `sequence form` strategies ( $x^\triangle$ ) represent the probability of playing a *sequence* of actions from the root. This transforms the decision space into a convex and compact set, and the expected loss becomes a linear function over  $X^\triangle$ .
*   **Recursive Definition:** The `sequence-form representation`  $X^\triangle$  is built recursively:
    *   At every `observation point`k \in \mathcal{K}

: The strategy space $X_k^\triangle$ is the Cartesian product of the strategy spaces of its child decision points. $X_k^\triangle := X_{j_1}^\triangle \times X_{j_2}^\triangle \times \cdot \cdot \cdot \times X_{j_{n_k}}^\triangle$ Where $\{j_1, j_2, \ldots, j_{n_k}\} = \mathcal{C}_k$ are the child decision points of $k$ . This means a strategy at an observation node consists of independently chosen strategies for each subsequent decision node. * At every decision pointj \in \mathcal{I} $: The strategy space $X_j^\triangle$ represents choices for actions at $j$ and their subsequent subgames. If an action $a_i \in A_j$ is chosen with probability $\lambda_i$, then the sequence $a_i$ is taken with probability $\lambda_i$, and the strategies in the subsequent subgame $X_{k_i}^\triangle$ (where $k_i = \rho(j, a_i)$) are scaled by $\lambda_i$.$ X_j^\triangle := \left{ \left( \begin{array} {c} \lambda_1 x_{k_1} \ \vdots \ \lambda_{n_j} x_{k_{n_j}} \end{array} \right) : \left( \lambda_1, \ldots, \lambda_n \right) \in \Delta^{n_j}, x_{k_1} \in X_{k_1}^\triangle, x_{k_2} \in X_{k_2}^\triangle, \ldots, x_{k_{n_j}} \in X_{k_{n_j}}^\triangle \right}

Where  $\{k_1, k_2, \ldots, k_{n_j}\} = \mathcal{C}_j$  are the child observation points of  $j$ . The vector  $(x_{k_1}, \dots, x_{k_{n_j}})$  here is slightly misleading, as the sequence form for  $X_j^\triangle$  represents paths *originating* from  $j$ . A more precise interpretation consistent with typical sequence form definitions is that  $X_j^\triangle$  contains sequences that extend from  $j$ . The vector of decisions  $\hat{x}_j = (\lambda_1, \dots, \lambda_{n_j})$  is the probability distribution over actions at  $j$ . The actual sequence form representation  $x_j^\triangle$  at node  $j$  would have entries  $\lambda_a x_{k_a}^\triangle$  for each action  $a \in A_j$ , where  $x_{k_a}^\triangle$  is the sequence form strategy for the subtree rooted at  $k_a$ .
*   **Root Strategy Space:** The overall `sequence form strategy space` for the process is  $X_r^\triangle$ , where  $r$  is the root.
*   **Bilinear Saddle-Point Problem (BSPP):** With sequence form, finding a Nash equilibrium in a two-player zero-sum EFG becomes a `BSPP` of the form  $\operatorname*{min}_{x \in \mathcal{X}} \operatorname*{max}_{y \in \mathcal{Y}} x^\top A y$ , where  $\mathcal{X}=X^\triangle$  and  $\mathcal{Y}=Y^\triangle$  are the sequence-form strategy spaces of the two players, and  $A$  encodes the leaf payoffs.
*   **Notation:**
    *    $[v]_{\downarrow j}$ : Subset of entries of vector  $v$  referring to decision points and actions at or below node  $j$ .
    *   `[v]_j`: Subset of  $n_j = |A_j|$  entries of  $v$  pertaining to actions  $a \in A_j$  at decision point  $j$ .

### 4.2.2. Stable-Predictive Regret Minimizers (Section 3)

The paper introduces a new class of `predictive regret minimizers`.

*   **Online Learning Setup:** A decision maker chooses  $x^t \in \mathcal{X}$  (convex, compact set). An environment reveals convex loss  $\ell^t$ . The decision maker receives a `prediction`m^t

of $\ell^t$ .

Definition 1 (Stable-predictive regret minimizer): A predictive regret minimizer is $(\kappa, \alpha, \beta)`-stable-predictive` if: 1. **Stability:** The decisions produced change slowly.$ | x^{t+1} - x^t | \leq \kappa \quad \forall t \geq 1. $This means the change in strategy between successive iterations is bounded by $\kappa$. 2. **Prediction bound:** The cumulative regret up to time $T$ is bounded according to:$ R^T \leq \frac{\alpha}{\kappa} + \beta\kappa \sum_{t=1}^T | \ell^t - m^t |_*^2.

Here,  $\alpha$  and  $\beta$  are constants. The term  $\frac{\alpha}{\kappa}$  represents a baseline regret, while  $\beta\kappa \sum_{t=1}^T \| \ell^t - m^t \|_*^2$  quantifies the penalty for prediction errors. If predictions  $m^t$  perfectly match  $\ell^t$ , the sum term becomes zero, and regret remains asymptotically constant.
*   **Comparison to RVU property (Syrgkanis et al., 2015):** The paper highlights differences:
    *   `Syrgkanis et al. (2015)` assume  $m^t = \ell^{t-1}$ , leading to a term  $\| \ell^t - \ell^{t-1} \|_*^2$ . This paper makes no such assumption on  $m^t$ .
    *   The `RVU property` includes a cancellation term  $-\gamma' \sum \| x^t - x^{t-1} \|^2$ . This paper replaces it with the explicit `stability` condition (Equation 5).
    *   The coefficients  $\frac{\alpha}{\kappa}$  and  $\beta\kappa$  in Equation 6 are forced to be inversely proportional and tied to  $\kappa$ , which `Syrgkanis et al. (2015)` showed for `OFTRL` but did not require in their general `RVU` definition.

**Relationship with Bilinear Saddle-Point Problems (Section 3.1):**
*   **Saddle Point Residual ( $\xi$ ):** Measures how close  $(\bar{x}, \bar{y})$  is to a saddle point.

\xi := \operatorname*{max}_{\hat{y} \in \mathcal{V}} \bar{x}^\top A \hat{y} - \operatorname*{min}_{\hat{x} \in \mathcal{X}} \hat{x}^\top A \bar{y}
 $Where $\bar{x}$ and $\bar{y}$ are average strategies. * **Regret-to-Residual Bound:** For two regret minimizers (one for player X, one for player Y) observing loss vectors $\ell_{\mathcal{X}}^t := -A y^t$ and $\ell_{\mathcal{Y}}^t := A^\top x^t$, the average decisions $(\frac{1}{T} \sum x^t, \frac{1}{T} \sum y^t)$ have residual $\xi$ bounded by:$ 
\xi \leq \frac{1}{T} (R_{\mathcal{X}}^T + R_{\mathcal{Y}}^T)
 $* **Derivation of $O(T^{-3/4})$ Rate:** If both regret minimizers are $(\kappa, \alpha, \beta)$-stable-predictive and use predictions $m_\mathcal{X}^t := \ell_\mathcal{X}^{t-1}, m_\mathcal{Y}^t := \ell_\mathcal{Y}^{t-1}$, then the residual of the average decisions satisfies:$ 
\begin{array}{r l}
& T \xi \leq \displaystyle \frac{2\alpha}{\kappa} + \beta\kappa \sum_{t=1}^T \| -A y^t + A y^{t-1} \|_*^2 \\
& \qquad + \beta\kappa \displaystyle \sum_{t=1}^T \| A^\top x^t - A^\top x^{t-1} \|_*^2 \\
& \leq \displaystyle \frac{2\alpha}{\kappa} + \beta \| A \|_{\mathrm{op}}^2 \kappa \left( \displaystyle \sum_{t=1}^T \| x^t - x^{t-1} \|^2 + \displaystyle \sum_{t=1}^T \| y^t - y^{t-1} \|^2 \right) \\
& \leq \displaystyle \frac{2\alpha}{\kappa} + 2\beta T \| A \|_{\mathrm{op}}^2 \kappa^3
\end{array}

The second inequality uses the property that  $\|Av\|_* \leq \|A\|_{\mathrm{op}} \|v\|$  for an operator norm  $\|A\|_{\mathrm{op}}$ . The third inequality uses the `stability condition`  $\|x^t - x^{t-1}\| \leq \kappa$  (Equation 5) and similarly for  $y^t$ .
    If  $\kappa$  is chosen as  $\Theta(T^{-1/4})$ , then  $T\xi \leq O(T^{1/4}) + O(T (T^{-1/4})^3) = O(T^{1/4}) + O(T^{1/4}) = O(T^{1/4})$ .
    Dividing by  $T$ , we get  $\xi = O(T^{-3/4})$ . This demonstrates the theoretical speedup from  $O(T^{-1/2})$  (for regular regret minimizers) to  $O(T^{-3/4})$ .

**Optimistic Follow the Regularized Leader (OFTRL) (Section 3.2):**
The paper uses `OFTRL` as a concrete instance of a `stable-predictive regret minimizer`.
*   **Algorithm:** At each time  $t$ , `OFTRL` outputs the decision  $x^t$ :

x^t = \underset{\tilde{x} \in \mathcal{X}}{\mathrm{argmin}} \left\{ \langle \tilde{x}, m^t + \sum_{\tau=1}^{t-1} \ell^\tau \rangle + \frac{1}{\eta} R(\tilde{x}) \right\}

Where:
    *    $\eta > 0$  is a step size constant.
    *    $R(\cdot)$  is a `1-strongly convex regularizer` with respect to the chosen norm  $\|\cdot\|$ . A common choice is  $R(x) = \frac{1}{2}\|x\|_2^2$  (entropy regularizer is also common for simplex domains).
    *    $m^t$  is the prediction for the loss  $\ell^t$ .
*   **Parameters:**
    *    $\Delta_R := \operatorname*{max}_{x, y \in \mathcal{X}} \{ R(x) - R(y) \}$  is the diameter of the range of  $R$ .
    *    $\Delta_\ell := \operatorname*{max}_t \operatorname*{max} \{ \| \ell^t \|_*, \| m^t \|_* \}$  is the maximum (dual) norm of any loss vector or prediction.
*   **Theorem 1:** `OFTRL` is a  $3\Delta_\ell(\eta, \Delta_R, 1)$ -stable-predictive regret minimizer. (The constants are  $\alpha = 3\Delta_R$  and  $\beta=3\Delta_\ell$ ).
    *   **Proof (Appendix A):**
        *   **`argmin-function` Continuity (Lemma 5):** The function  $\tilde{x}(L) := \operatorname*{argmin}_{x \in \mathcal{X}} \{ \langle x, L \rangle + \frac{1}{\eta} R(x) \}$  is  $\eta$ -Lipschitz continuous with respect to the dual norm:

        \| \tilde{x}(L) - \tilde{x}(L') \| \leq \eta \| L - L' \|_*

This is proven by using the `variational inequality` for optimality and the `strong convexity` of  $R(\cdot)$ .
        *   **Stability of OFTRL (Corollary 2):** Applying Lemma 5, the iterates of OFTRL satisfy:

        \begin{array}{c}
        \| x^t - x^{t-1} \| = \left\| \tilde{x}( L^{t-1} + m^t ) - \tilde{x}( L^{t-2} + m^{t-1} ) \right\| \\
        \leq \eta \| \ell^{t-1} + m^t - m^{t-1} \|_* \leq 3\eta \Delta_\ell
        \end{array}

This shows that `OFTRL` satisfies the `stability` condition with  $\kappa = 3\eta \Delta_\ell$ .
        *   The prediction bound part of `stable-predictivity` (Equation 6) for `OFTRL` is stated to follow directly from the proof of Theorem 19 in Syrgkanis et al. (2015).

### 4.2.3. CFR as Regret Decomposition (Section 4)

The paper explains how `CFR` naturally decomposes the overall strategy space  $X^\triangle$  into local decisions.
*   **Counterfactual Loss Functions:** For each decision node  $j \in \mathcal{I}$ , CFR defines a local `counterfactual loss function`  $\hat{\ell}_j^{t, \circ} : \Delta^{n_j} \to \mathbb{R}$ . This function measures the expected loss if the agent only changes its strategy at  $j$  to  $x_j$ , while maintaining the global strategy  $x^{\triangle, t}$  everywhere else.

\hat{\ell}_j^{t, \circ} : x_j = (x_{ja_1}, \ldots x_{ja_{n_j}}) \mapsto \langle [\ell^{\triangle, t}]_j, x_j \rangle + \sum_{a \in A_j} \left( x_{ja} \sum_{j' \in \mathcal{C}_{ja}} \langle [\ell^{\triangle, t}]_{\downarrow j'}, [x^{\triangle, t}]_{\downarrow j'} \rangle \right)

This is then represented as a vector  $\hat{\ell}_j^t$  such that  $\hat{\ell}_j^{t, \circ}(x_j) = \langle \hat{\ell}_j^t, x_j \rangle$ .
*   **Local Regret:** For each  $j \in \mathcal{I}$ , a sequence of local decisions  $\hat{x}_j^1, \ldots, \hat{x}_j^T$  has a `local cumulative regret`  $\hat{R}_j^T$ :

\hat{R}_j^T := \sum_{t=1}^T \langle \hat{\ell}_j^t, \hat{x}_j^t \rangle - \operatorname*{min}_{\tilde{x}_j \in \Delta^{n_j}} \sum_{t=1}^T \langle \hat{\ell}_j^t, \tilde{x}_j \rangle
 $This is the regret accumulated by the local regret minimizer $\mathcal{\hat{R}}_j$ operating only on the simplex $\Delta^{n_j}$. * **Inductive View of CFR:** CFR is viewed as recursively building regret minimizers for subtrees. * $\mathcal{R}_v^\triangle$: The regret minimizer for the subtree rooted at node $v \in \mathcal{I} \cup \mathcal{K}$. * $R_v^{\triangle, T}$: The regret of $\mathcal{R}_v^\triangle$. * $\ell_v^{\triangle, t}$: The loss function entering $\mathcal{R}_v^\triangle$. * **Loss and Strategy Forwarding:** * Loss decomposition:$ 
    \ell_k^{\triangle, t} = [\ell_j^{\triangle, t}]_{\downarrow k} \quad \forall k \in \mathcal{C}_j, \quad \mathrm{and} \quad \ell_j^{\triangle, t} = [\ell_k^{\triangle, t}]_{\downarrow j} \quad \forall j \in \mathcal{C}_k.

This means loss vectors are partitioned and forwarded down the tree.
    *   Strategy composition:

    \forall k \in \mathcal{K}, \quad x_k^{\triangle, t} = (x_{j_1}^{\triangle, t}, \dots, x_{j_{n_k}}^{\triangle, t}), \quad \mathrm{where} \ \{j_1, \dots, j_{n_k}\} = \mathcal{C}_k,
    
    \forall j \in \mathcal{I}, \quad x_j^{\triangle, t} = \left( \hat{x}_j^t, \hat{x}_{ja_1}^t x_{\rho(j, a_1)}^{\triangle, t}, \dots, \hat{x}_{ja_{n_j}}^t x_{\rho(j, a_{n_j})}^{\triangle, t} \right) \quad \mathrm{where} \ \{a_1, \dots, a_{n_j}\} = A_j.
     $The second equation expresses how the overall strategy $x_j^{\triangle, t}$ at decision node $j$ is composed of the local decision $\hat{x}_j^t$ (a probability distribution over $A_j$) and the scaled strategies from its child observation nodes. * **Regret Decomposition Lemmas:** These lemmas show how the regret of a parent node relates to the regrets of its children. They are crucial for the inductive proof. * **Lemma 1 (Observation Node):** Let $k \in \mathcal{K}$ be an observation node. Then, the regret of the minimizer for $k$ is the sum of regrets of its children decision nodes.$ 
    R_k^{\triangle, T} = \sum_{j \in \mathcal{C}_k} R_j^{\triangle, T}
     $**Proof Sketch:** The strategy space $X_k^\triangle$ is a Cartesian product, and the loss vector $\ell_k^{\triangle, T}$ is partitioned. This allows the overall regret minimization problem at $k$ to be broken down into independent sum of minimizations for each child $j \in \mathcal{C}_k$. * **Lemma 2 (Decision Node):** Let $j \in \mathcal{I}$ be a decision node. Then, the regret of the minimizer for $j$ is bounded by the local counterfactual regret at $j$ plus the maximum regret of its child observation nodes.$ 
    R_j^{\triangle, T} \leq \hat{R}_j^T + \operatorname*{max}_{k \in \mathcal{C}_j} R_k^{\triangle, T}

**Proof Sketch:** The crucial step is to separate the loss terms related to the local decision  $\hat{x}_j^t$  from those related to the subtrees  $x_k^{\triangle, t}$ . The expected loss at  $j$  is related to  $\hat{\ell}_j^t$  and the regrets  $R_k^{\triangle, T}$  of children  $k \in \mathcal{C}_j$ . The maximum over children's regrets comes from bounding a term  $\sum_{a \in A_j} \tilde{x}_{ja} R_k^{\triangle, T}$  where  $\tilde{x}_{ja}$  are probabilities.

### 4.2.4. Stable-Predictive Counterfactual Regret Minimization (Section 5)

This section describes the proposed algorithm by specifying how `stability parameters` and `predictions` are set for each local `stable-predictive regret minimizer`.

**5.1. Choice of Stability Parameters:**
*   **Hierarchical Stability  $\gamma_v$ :** A scalar  $\gamma_v$  is associated with each node  $v \in \mathcal{I} \cup \mathcal{K}$ .
    *   For the root decision node  $r$ ,  $\gamma_r$  is set to the desired global stability parameter  $\kappa^*$ .
    *   For other nodes  $v$ ,  $\gamma_v$  is set relative to its parent  $u$ :

    \gamma_v := \left\{ \begin{array}{l l} \displaystyle \frac{\gamma_u}{2 \sqrt{n_u}} & \mathrm{if ~} u \in \mathcal{I} \\ \displaystyle \frac{\gamma_u}{\sqrt{n_u}} & \mathrm{if ~} u \in \mathcal{K}. \end{array} \right.
     $Where $n_u$ is the number of actions at decision node $u$ or the number of signals at observation node $u$. This recursive definition means stability decreases (becomes "less stable", allowing larger changes) deeper in the tree, but the *effective* stability parameter for the local regret minimizer $\mathcal{\hat{R}}_j$ is scaled to balance this. * **Local Stability $\kappa_j$:** The stability parameter for the local regret minimizer $\mathcal{\hat{R}}_j$ at decision point $j \in \mathcal{I}$ is chosen as:$ 
\kappa_j := \frac{\gamma_j}{2 \sqrt{n_j} B_j}
 $Where $B_j$ is an upper bound on the 2-norm of any sequence-form strategy $X_j^\triangle$. This $B_j$ is found recursively:$ 
\begin{array}{l l} \forall k \in \mathcal{K}, \quad B_k = \displaystyle \sqrt{\sum_{j \in \mathcal{C}_k} B_j^2} \\ \forall j \in \mathcal{I}, \quad B_j = \displaystyle \sqrt{1 + \displaystyle \operatorname*{max}_{k \in \mathcal{C}_j} B_k^2} \end{array}

(Note: The paper's image/text has a typo in Equation (13) which seems to be a different definition. I'm using the one provided for  $B_k$  and  $B_j$  in the section 5.1 text for consistency with the derivation.)  $B_j$  essentially measures the "size" of the strategy space.
*   **OFTRL Step Size:** For `OFTRL`, the step size  $\eta$  is set to  $\eta = \kappa_j$  (assuming a specific bound on loss vector norms).

**5.2. Prediction of Counterfactual Loss Vectors:**
*   **Global Prediction:** Let  $m^{\triangle, t}$  be the prediction for the global loss vector  $\ell^{\triangle, t}$  received by  $\mathcal{R}^\triangle$ .
*   **Counterfactual Prediction Function:** Similar to counterfactual loss, a counterfactual prediction function  $\hat{m}_j^{t, \circ}$  is defined for each decision point  $j \in \mathcal{I}$ :

\hat{m}_j^{t, \circ} : \Delta^{n_j} \ni x_j = (x_{ja_1}, \ldots, x_{ja_{n_j}}) \mapsto \langle [m^{\triangle, t}]_j, x_j \rangle + \sum_{a \in A_j} \left( x_{ja} \sum_{j' \in \mathcal{C}_{ja}} \langle [m^{\triangle, t}]_{\downarrow j'}, [x^{\triangle, t}]_{\downarrow j'} \rangle \right).

This implies that the prediction for the local `counterfactual loss` depends on the global prediction  $m^{\triangle, t}$  and the current (time  $t$ ) strategies  $x^{\triangle, t}$  in the subtrees.
*   **Counterfactual Prediction Vector:**  $\hat{m}_j^t$  is the unique vector in  $\mathbb{R}^{n_j}$  such that  $\hat{m}_j^{t, \circ}(x_j) = \langle \hat{m}_j^t, x_j \rangle$ .

**5.3. Proof of Correctness:**
The paper proves by induction that the chosen stability parameters and predictions guarantee that  $\mathcal{R}^\triangle$  (the overall regret minimizer) is a  $(\kappa^*, O(1), O(1))$ -stable-predictive regret minimizer. The induction proceeds bottom-up from the leaves of the sequential decision process. The proofs use the 2-norm for simplicity, acknowledging that all norms are equivalent in finite-dimensional spaces.

*   **Lemma 3 (Observation Node Inductive Step):** If all child decision nodes  $j \in \mathcal{C}_k$  are  $(\gamma_j, O(1), O(1))$ -stable-predictive, then the parent observation node  $k$  is  $(\gamma_k, O(1), O(1))$ -stable-predictive.
    *   **Regret Bound:** By summing the regret bounds of children (from Lemma 1) and substituting  $\gamma_j = \gamma_k / \sqrt{n_k}$  (from Equation 14), and recognizing that local losses/predictions form a partition of the parent's loss/prediction, the regret bound for  $R_k^{\triangle, T}$  is derived:

    R_k^{\triangle, T} \leq \frac{O(1)}{\gamma_k} + O(1) \gamma_k \sum_{t=1}^T \| \ell_k^{\triangle, t} - m_k^{\triangle, t} \|_2^2
     $* **Stability Bound:** The stability of the overall strategy $x_k^{\triangle, t}$ is derived from the stability of its children $x_j^{\triangle, t}$:$ 
    \| x_k^{\triangle, t} - x_k^{\triangle, t-1} \|_2 = \sqrt{ \sum_{j \in \mathcal{C}_k} \| x_j^{\triangle, t} - x_j^{\triangle, t-1} \|_2^2 } \leq \sqrt{ \sum_{j \in \mathcal{C}_k} \gamma_j^2 } = \gamma_k
     $This uses the strategy composition rule (Equation 15) and the definition of $\gamma_j$ from Equation 14. * **Lemma 4 (Decision Node Inductive Step):** If all child observation nodes $k \in \mathcal{C}_j$ are $(\gamma_k, O(1), O(1))$-stable-predictive, and the local regret minimizer $\mathcal{\hat{R}}_j$ is $(\kappa_j, O(1), O(1))$-stable-predictive, then the parent decision node $j$ is $(\gamma_j, O(1), O(1))$-stable-predictive. * **Regret Bound:** The regret bound from Lemma 2 is used: $R_j^{\triangle, T} \leq \hat{R}_j^T + \operatorname*{max}_{k \in \mathcal{C}_j} R_k^{\triangle, T}$. By substituting the stable-predictivity bounds for $\hat{R}_j^T$ and $R_k^{\triangle, T}$, and using the relationship between $\kappa_j$, $\gamma_j$, and $\gamma_k$ (Equations 14, 15), and the norm bounds of the strategy space ($B_k$), the regret bound for $R_j^{\triangle, T}$ is shown to be:$ 
    R_j^{\triangle, T} \leq \frac{O(1)}{\gamma_j} + O(1) \gamma_j \sum_{t=1}^T \| \ell_j^{\triangle, t} - m_j^{\triangle, t} \|_2^2

A key step involves bounding  $\| \hat{\ell}_j^t - \hat{m}_j^t \|_2^2$  in terms of  $\| \ell_j^{\triangle, t} - m_j^{\triangle, t} \|_2^2$  using the definitions of counterfactual losses/predictions and the  $B_k$  bounds.
    *   **Stability Bound:** The stability of  $x_j^{\triangle, t}$  is derived using its composition (Equation 16), the stability of  $\mathcal{\hat{R}}_j$  (i.e.,  $\| \hat{x}_j^t - \hat{x}_j^{t-1} \|_2 \leq \kappa_j$ ), and the inductive hypothesis for children  $k \in \mathcal{C}_j$  (i.e.,  $\| x_k^{\triangle, t} - x_k^{\triangle, t-1} \|_2 \leq \gamma_k$ ). After some algebraic manipulation involving the Cauchy-Schwarz inequality and  $B_j$  definitions, it is shown that  $\| x_j^{\triangle, t} - x_j^{\triangle, t-1} \|_2 \leq \gamma_j$ .

*   **Corollary 1 (Main Theoretical Result):**
    If:
    1.  Each local regret minimizer  $\mathcal{\hat{R}}_j$  is  $(\kappa_j, O(1), O(1))$ -stable-predictive (where  $\kappa_j$  is as in Equation 15).
    2.   $\mathcal{\hat{R}}_j$  observes the counterfactual loss prediction  $\hat{m}_j^t$  (Equation 17).
    3.   $\mathcal{\hat{R}}_j$  observes the counterfactual loss  $\hat{\ell}_j^t$  (Equation 10).
        Then, the overall regret minimizer  $\mathcal{R}^\triangle$  for the entire sequential decision process is a  $(\kappa^*, O(1), O(1))$ -stable-predictive regret minimizer operating over the sequence-form strategy space.

**Final Convergence Rate:** By combining Corollary 1 with the `Bilinear Saddle-Point Problem` analysis from Section 3.1, the paper concludes that by constructing two  $(\Theta(T^{-1/4}), O(1), O(1))$ -stable-predictive regret minimizers (one for each player), the algorithm achieves an  $O(T^{-3/4})$  `Nash equilibrium` in two-player zero-sum games at time  $T$ .

# 5. Experimental Setup

## 5.1. Datasets
The experiments are conducted on `Heads-Up No-Limit Texas Hold'em (HUNL)` subgames, a benchmark domain in AI research for imperfect-information games. These specific subgames were used by the Libratus AI, which famously defeated top poker professionals (Brown & Sandholm, 2017b).

**Heads-Up No-Limit Texas Hold'em (HUNL) Game Description:**
*   **Players:** Two players,  $P_1$  and  $P_2$ , each start with a stack of \$20,000.
*   **Positions:** Player positions (dealer/big blind) switch after each hand.
*   **Actions:** Players can `fold`, `call`, or `raise` on their turn.
    *   `Fold`: Player loses, the opponent takes the pot.
    *   `Call`: Player matches the opponent's most recent bet, placing an equal number of chips in the pot.
    *   `Raise`: Player adds more chips to the pot than the opponent's share.
*   **Betting Rounds:** Four betting rounds:
    1.  **Pre-flop:**  $P_1$  posts a `small blind` of \`50,`P_2`posts a `big blind` of \`100. (The paper states  $P_1$  places \`100 and`P_2 $places \$ 50, which implies  $P_1$  is the big blind, differing from standard. This is likely a specific convention for their subgames). Both players receive two private cards.  $P_2$  acts first.
    2.  **Flop:** Three community cards are dealt face up.  $P_1$  acts first.
    3.  **Turn:** One more community card is dealt face up.  $P_1$  acts first.
    4.  **River:** One final community card is dealt face up.  $P_1$  acts first.
*   **Round End:** A round ends when both players have acted at least once and the most recent player has called.
*   **All-in:** Players cannot raise beyond their initial \$20,000 stack.
*   **Raise Rules:** All raises must be at least \$100 and at least as large as any previous raise in that round.
*   **Showdown:** If no player folds, the player with the best five-card poker hand (using their two private cards and five community cards) wins the pot. Ties split the pot evenly.

**Subgame Characteristics:**
The experiments are conducted on four open-source subgames derived from Libratus's real-time solving strategy.
*   **Bet Sizes:**
    *   **First bet of each round:**  $0.5 \times$  pot size,  $1 \times$  pot size, `all-in` bet.
    *   **Subsequent bets in a round:**  $1 \times$  pot size, `all-in` bet.
*   **Subgame 1:** Occurs over the third and fourth betting rounds. Pot size at start: \$500.
*   **Subgame 2:** Occurs over the third and fourth betting rounds. Pot size at start: \$4,780.
*   **Subgame 3:** Occurs over only the fourth betting round. Pot size at start: \$500.
*   **Subgame 4:** Occurs over only the fourth betting round. Pot size at start: \$3,750.

    These subgames are chosen because they represent realistic, complex scenarios from a high-stakes poker environment, making them effective for validating the performance of game-solving algorithms. The varying pot sizes and number of remaining betting rounds (`depth` of the game tree) allow for testing the algorithms under different scales of complexity.

## 5.2. Evaluation Metrics

The primary evaluation metric used is `exploitability`, measured in `milli big blinds per game (mbb/g)`.

*   **Exploitability:**
    1.  **Conceptual Definition:** In game theory, `exploitability` quantifies how much an opponent can gain by deviating from their optimal strategy against the current strategy of a player. In other words, it measures the maximum expected value an opponent can achieve against a given strategy. If a strategy has zero exploitability, it is a perfect Nash equilibrium. A lower exploitability value indicates a strategy that is closer to a Nash equilibrium and thus harder to beat.
    2.  **Mathematical Formula:** For a two-player zero-sum game, if player 1 uses strategy  $x$  and player 2 uses strategy  $y$ , the payoff to player 1 is  $x^\top A y$ . The exploitability of player 1's strategy  $x$  (relative to player 2's optimal counter-strategy) is:

    \operatorname{Exploitability}(x) = \operatorname*{max}_{\hat{y} \in \mathcal{Y}} x^\top A \hat{y} - \operatorname*{min}_{\hat{x} \in \mathcal{X}} \hat{x}^\top A y^*

The standard definition of exploitability for an approximate Nash equilibrium  $(\bar{x}, \bar{y})$  is the `saddle point residual`  $\xi$  as defined in Section 3.1:

    \xi = \operatorname*{max}_{\hat{y} \in \mathcal{Y}} \bar{x}^\top A \hat{y} - \operatorname*{min}_{\hat{x} \in \mathcal{X}} \hat{x}^\top A \bar{y}

This measures the difference between what Player 1 could get against Player 2's average strategy  $\bar{y}$  (i.e.,  $\operatorname*{max}_{\hat{y} \in \mathcal{Y}} \bar{x}^\top A \hat{y}$ ) and what Player 2 could get against Player 1's average strategy  $\bar{x}$  (i.e.,  $-\operatorname*{min}_{\hat{x} \in \mathcal{X}} \hat{x}^\top A \bar{y}$ ). The value represents how much payoff could be extracted by an optimal counter-player.
    3.  **Symbol Explanation:**
        *    $A$ : The game's payoff matrix in `sequence form`.
        *    $\mathcal{X}$ : The `sequence-form strategy space` for Player 1.
        *    $\mathcal{Y}$ : The `sequence-form strategy space` for Player 2.
        *    $\bar{x}$ : The average strategy of Player 1 over  $T$  iterations.
        *    $\bar{y}$ : The average strategy of Player 2 over  $T$  iterations.
        *    $\hat{x}$ : A hypothetical strategy for Player 1.
        *    $\hat{y}$ : A hypothetical strategy for Player 2.

*   **milli big blinds per game (mbb/g):**
    1.  **Conceptual Definition:** This is the standard measurement of win rate in poker literature. A `big blind (bb)` is a mandatory bet placed by one of the players at the start of a poker hand, defining the base stakes of the game. `mbb/g` measures the average amount of money (in thousandths of a big blind) a player would win or lose per hand against an optimal opponent. Lower (more negative) mbb/g for an opponent means they are being exploited more. For evaluating approximate Nash equilibria, `mbb/g` represents how far the strategy is from a perfect equilibrium in terms of average dollars lost to an optimal counter-strategy.
    2.  **Mathematical Formula:** Not explicitly provided in the paper, but commonly defined as:

    \mathrm{mbb/g} = \left( \frac{\text{Exploitability in chips}}{\text{Big Blind size in chips}} \right) \times 1000
    \$\$
3.  **Symbol Explanation:**
    *   `Exploitability in chips`: The value of the saddle point residual, converted to monetary units (chips).
    *   `Big Blind size in chips`: The monetary value of one big blind in the game. In the HUNL subgames, the initial big blind is \$100.
    *   `1000`: Conversion factor from big blinds to milli big blinds.

5.3. Baselines

The proposed method is compared against two standard algorithms for solving extensive-form games:

Vanilla CFR (labeled CFR in plots): This refers to the original Counterfactual Regret Minimization algorithm using regret matching as its strategy update rule. It is the foundational CFR algorithm with a theoretical worst-case convergence rate of $O(T^{-1/2})$ .
Discounted CFR (DCFR) (Brown & Sandholm, 2019) (labeled DCFR in plots): This is a current state-of-the-art algorithm in practice. DCFR is a variant of CFR that empirically converges significantly faster than vanilla CFR and $CFR+$ by discounting past regrets, giving more weight to recent iterations. Like vanilla CFR, its worst-case theoretical rate remains $O(T^{-1/2})$ .

The paper's method, Stable-Predictive Optimistic CFR with OFTRL (labeled OFTRL in plots), is compared against these. For OFTRL, two configurations are tested:
OFTRL theory: Uses the step size ( $\eta$ ) parameter determined by the theoretical derivation (which is $\eta = \kappa_j$ , where $\kappa_j$ depends on $\gamma_j$ and $B_j$ ).
OFTRL tuned: For larger subgames where the theoretically-correct step size was found to be too conservative, a manually tuned step size (by dividing the theoretical step size by 10, 100, or 1000 and picking the best) is used.

All algorithms are run with two types of updates:
Simultaneous updates: All players update their strategies at the same time based on the current state. This is the traditional way CFR is often implemented.
Alternating updates: Players update their strategies in sequence, with each player's update potentially influencing the next player's update in the same iteration. This is a common practical optimization that often leads to better empirical performance.

6. Results & Analysis

The experiments evaluate the convergence rate of the proposed Stable-Predictive Optimistic CFR (OFTRL) algorithm against vanilla CFR and Discounted CFR (DCFR) on four Heads-Up No-Limit Texas Hold'em (HUNL) subgames. The convergence is measured by exploitability in milli big blinds per game (mbb/g) against the number of iterations.

6.1. Core Results Analysis

The results are presented across four figures, comparing algorithm performance under simultaneous updates (Figures 2 & 3) and alternating updates (Figures 4 & 5).

6.1.1. Simultaneous Updates

This setting is the traditional way CFR is often used.

As seen from the results in Figure 2, for subgames 2 and 4 (larger pot sizes), and Figure 3, for subgames 1 and 3 (smaller pot sizes):

$Figure 2. Convergence rate with iterations on the $\\mathbf { X }$ ai, and the exploitability in mbb. All algorithms use simultaneous updates.$ 该图像是图表，展示了在同时更新的情况下，两个子游戏中不同算法的鞍点差距随着迭代次数的变化。左侧是子游戏2，右侧是子游戏4。CFR、DCFR和其他算法在不同迭代下的表现被比较，显示了各自的收敛速度。

Figure 2. Convergence rate with iterations on the $\\mathbf { X }$ ai, and the exploitability in mbb. All algorithms use simultaneous updates.

$Figure 3. Convergence rate with iterations on the $\\mathbf { X }$ axis, and the exploitability in mbb. All algorithms use simultaneous updates.$ 该图像是图表，展示了在子游戏1和子游戏3的迭代过程中，CFR和DCFR算法的鞍点间隙（mbb/g）随迭代次数的变化。横轴为迭代次数，纵轴为鞍点间隙，所有算法均采用同时更新。

Figure 3. Convergence rate with iterations on the $\\mathbf { X }$ axis, and the exploitability in mbb. All algorithms use simultaneous updates.

Subgames 3 and 4 (Smaller Games):
- OFTRL theory vs. CFR: In Subgame 4, OFTRL theory (the proposed method with theoretically derived step sizes) outperforms CFR almost immediately and significantly. In Subgame 3, OFTRL theory also eventually outperforms CFR, but only after approximately 800 iterations. This suggests that in relatively smaller game trees, the theoretical advantages of OFTRL with its $O(T^{-3/4})$ rate begin to manifest, outperforming the $O(T^{-1/2})$ CFR.
- DCFR vs. all: DCFR consistently performs best across both subgames, significantly outperforming both CFR and OFTRL theory. Its empirical performance is much better than its worst-case theoretical rate of $O(T^{-1/2})$ would suggest.
Subgames 1 and 2 (Larger Games):
- OFTRL theory: In these larger subgames, the theoretically derived step size for OFTRL is found to be "much too conservative". The algorithm makes very little progress within the given number of iterations, essentially appearing flat or very slow. This suggests that while theoretically correct, the derived step sizes might be overly cautious, particularly in complex, larger game states, hindering practical convergence.
- OFTRL tuned: With a moderately hand-tuned step size (by dividing the theoretical step size by 10, 100, or 1000), OFTRL tuned performs "somewhat significantly" better than CFR. This indicates that the optimistic regret minimization framework does have practical benefits over vanilla CFR, but the theoretical parameter settings are not always optimal for real-world performance, necessitating empirical tuning.
- DCFR vs. all: Similar to the smaller subgames, DCFR remains the fastest algorithm in both Subgame 1 and Subgame 2, demonstrating its robust practical efficiency.
  
  Summary for Simultaneous Updates: The proposed OFTRL method shows theoretical promise and can outperform vanilla CFR, especially in smaller games or with tuned parameters in larger games. However, DCFR, despite its worse theoretical worst-case bound, remains the empirical state-of-the-art, highlighting the gap between theoretical guarantees and practical performance.

6.1.2. Alternating Updates

This setting is a common practical optimization for CFR algorithms.

As seen from the results in Figure 4, for subgames 2 and 4, and Figure 5, for subgames 1 and 3:

该图像是两幅图表，分别展示了在不同子游戏下的鞍点间隙随迭代次数变化的情况。左侧为子游戏 2，右侧为子游戏 4。图中展示了 CFR、DCFR 及 OFTRL 理论的鞍点间隙变化趋势，其中 CFR 的性能在迭代过程中有所改善，但仍低于 OFTRL 的表现。

Figure 4. Convergence rate wit iteraions te $\\mathbf { X }$ updates.

$该图像是一个比较不同算法在子博弈1和子博弈3中收敛速度的图表，展示了在迭代次数下鞍点间隙的变化。其中，CFR和DCFR的表现被对比，曲线显示出随着迭代次数的增加，CFR和DCFR的鞍点间隙逐渐减小。数据依赖于公式 $O(T^{-3/4})$ 收敛速率的可视化。$ 该图像是一个比较不同算法在子博弈1和子博弈3中收敛速度的图表，展示了在迭代次数下鞍点间隙的变化。其中，CFR和DCFR的表现被对比，曲线显示出随着迭代次数的增加，CFR和DCFR的鞍点间隙逐渐减小。数据依赖于公式 $O(T^{-3/4})$ 收敛速率的可视化。

Figure 5. Convergence rate wit iterations te $\\mathbf { X }$ updates.

Overall Performance of OFTRL: In the alternating-updates setting, OFTRL generally performs worse relative to CFR and DCFR compared to the simultaneous update scenario.
- In Subgame 1, OFTRL theory slightly outperforms CFR.
- In Subgame 2, their performance is "near-identical".
- In Subgames 3 and 4, even the manually-tuned OFTRL tuned variant performs worse than CFR. The authors suspect that a better choice of step size parameter might improve this.
DCFR vs. all: DCFR performs "significantly better than all other algorithms" in the alternating setting. This further cements DCFR's position as the strongest practical performer. The advantage of DCFR is more pronounced with alternating updates.

Summary for Alternating Updates: While alternating updates generally boost the performance of CFR-like algorithms, OFTRL does not gain as much benefit relative to CFR or DCFR. DCFR remains dominant, showing remarkable practical speed.

6.2. Data Presentation (Tables)

The paper presents its results primarily through graphs rather than tables. The figures themselves are the core data presentation.

Figures 2 & 3 show Exploitability (mbb/g) on the y-axis vs. Iterations on the x-axis for simultaneous updates.
Figures 4 & 5 show Exploitability (mbb/g) on the y-axis vs. Iterations on the x-axis for alternating updates.

The general trend observed in all graphs is that exploitability decreases as the number of iterations increases, indicating convergence towards a Nash equilibrium. The slope of the curve indicates the rate of convergence. A steeper negative slope (faster decrease) means faster convergence.

The key observation from these graphs, as discussed above, is the relative ordering of algorithms:

DCFR is consistently the fastest in practice, often by a large margin.
OFTRL (tuned) or OFTRL (theory) can be faster than CFR in some scenarios, but not universally, and often requires tuning.
CFR is the slowest among the tested algorithms in most cases.

6.3. Ablation Studies / Parameter Analysis

The paper does not explicitly present ablation studies in the traditional sense (e.g., removing components of OFTRL to see their impact). However, it does include a crucial parameter analysis through the comparison of OFTRL theory and OFTRL tuned.

OFTRL theory vs. OFTRL tuned:
- Impact of Step Size: The OFTRL theory variant uses the step size ( $\eta = \kappa_j$ ) derived directly from the theoretical guarantees. The results show that this theoretically correct step size can be "much too conservative," especially in larger subgames (Subgames 1 and 2 under simultaneous updates). This leads to very slow practical convergence.
- Empirical Tuning: By manually tuning the step size (e.g., dividing by 10, 100, or 1000), OFTRL tuned achieved significantly better performance than OFTRL theory and even surpassed CFR in some cases.
- Conclusion: This highlights a critical practical challenge: theoretically derived parameters, while guaranteeing worst-case bounds, may not be optimized for average-case or practical performance. Finding optimal step sizes empirically can dramatically improve performance, even if it deviates from the strict theoretical settings. This suggests a potential area for future research: how to derive less conservative or adaptive step sizes that maintain theoretical guarantees while improving practical speed.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper marks a significant theoretical advancement in the field of extensive-form game (EFG) solving. It presents the first Counterfactual Regret Minimization (CFR) variant that breaks the long-standing $O(T^{-1/2})$ theoretical convergence barrier, achieving an $O(T^{-3/4})$ rate. This is accomplished by meticulously integrating optimistic regret minimizers (specifically Optimistic Follow-the-Regularized-Leader, OFTRL) into the CFR framework. The core innovations include defining a new notion of stable-predictivity for regret minimizers and developing a hierarchical scheme for setting their stability parameters relative to their position in the decision tree. This work begins to bridge the theoretical gap between CFR and faster first-order methods. While experiments show that the proposed OFTRL variant can outperform vanilla CFR, it does not yet match the empirical speed of the state-of-the-art Discounted CFR (DCFR) in practice, despite DCFR's slower worst-case theoretical guarantee.

7.2. Limitations & Future Work

The authors identify several key limitations and suggest future research directions:

Gap between Theory and Practice: The primary limitation is the observation that DCFR (and $CFR+$ ), despite having a theoretical worst-case convergence rate of $O(T^{-1/2})$ , empirically performs better than OFTRL which has a superior $O(T^{-3/4})$ theoretical rate. This suggests that the worst-case analysis might not fully capture the typical performance on many practical games of interest.
Sensitivity to Step Size: The theoretical step size for OFTRL was found to be too conservative in larger games, requiring manual tuning for better practical performance. This indicates a need for more robust or adaptive parameter selection strategies.
Future Work: The authors propose that future research should focus on finding variants of their algorithm that can still satisfy the theoretical guarantees while performing even better in practice. This could involve exploring different regularizers, prediction strategies, or more sophisticated adaptive step-size rules.

7.3. Personal Insights & Critique

Theoretical Breakthrough vs. Practical Reality: The paper's achievement of an $O(T^{-3/4})$ rate within CFR is a strong theoretical result. It demonstrates that the CFR family is not inherently limited to the square-root convergence rate. This is crucial for advancing the theoretical understanding of CFR and its potential. However, the experimental results strikingly highlight the persistent gap between theoretical worst-case guarantees and average-case empirical performance. DCFR's continued dominance in practice, despite its theoretically slower rate, suggests that factors beyond asymptotic worst-case bounds (e.g., constant factors, specific game properties, numerical stability, implicit regularizations) play a significant role in real-world efficiency.
Elegance of Inductive Proof: The inductive proof structure, leveraging the recursive decomposition of CFR and the new stable-predictivity property, is quite elegant. It provides a clear blueprint for how to analyze and potentially accelerate other hierarchical regret minimization algorithms. The careful cascading of stability parameters ( $\gamma_v$ and $\kappa_j$ ) down the game tree is a clever mechanism to ensure overall algorithm stability.
The Challenge of Parameter Tuning: The necessity for manual tuning of step sizes for OFTRL to perform well in larger games is a practical hurdle. This suggests that the theoretical bounds, while valid, might be loose for specific problem instances or that the constants $\alpha, \beta$ in the stable-predictivity definition are large with the chosen parameters. Developing adaptive, theoretically-grounded step-size rules that are also empirically efficient remains a major challenge for many online learning algorithms, including this one.
Potential for Cross-Domain Application: The concept of stable-predictivity and the methodology for composing regret minimizers for sequential decision processes (Lemma 1, Lemma 2) could potentially be transferred or adapted to other domains involving hierarchical decision-making under uncertainty, such as partially observable Markov decision processes (POMDPs) or complex control systems, where local decision-makers need to coordinate their learning.
Future Value: This paper lays important groundwork. While DCFR is currently faster, the theoretical acceleration of CFR itself is a foundational step. Future work might combine the insights of DCFR (like discounting) with optimistic learning and stable-predictivity to achieve both theoretical speedups and state-of-the-art practical performance. The rigorous analysis provides a solid basis for further research into accelerated CFR variants.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.