Paper status: completed

Dynamic Discounted Counterfactual Regret Minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants’ performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which eno

Dynamic Discounted Counterfactual Regret Minimization (1)Policy Optimization in Markov Decision Processes (1)Solving Imperfect-Information Games (1)Improving Convergence of Iterative Algorithms (1)Dynamic Weight Learning (1)

Original Link

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces Dynamic Discounted CFR (DDCFR), the first framework to dynamically learn discounting for iterations. By formalizing the CFR process as a Markov decision process, DDCFR achieves faster convergence and high performance, significantly improving efficiency in so

Abstract

Hang Xu, Kai Li, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng Counterfactual regret minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants’ performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which enormously limits their potential. In this work, we propose Dynamic Discounted CFR (DDCFR), the first equilibrium-finding framework that discounts prior iterations using a dynamic, automatically-learned scheme. We formalize CFR’s iteration process as a carefully designed Markov decision process and transform the discounting scheme learning problem into a policy optimization problem within it. The learned discounting scheme dynamically weights each iteration on the fly using information available at runtime. Extensive experimental results demonstrate that DDCFR’s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance. The code is available at https://github.com/rpSebastian/DDCFR.

Mind Map

In-depth Reading

English Analysis~36 min read · 46,736 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Dynamic Discounted COunterfActuaL RegreT MInimIZatiON (DDCFR)," which introduces a novel framework for learning dynamic discounting schemes in Counterfactual Regret Minimization (CFR) algorithms, aimed at solving imperfect-information games more efficiently.

1.2. Authors

The authors are:

Hang Xu (Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences)
Kai Li (Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences)
Haobo Fu (Tencent AI Lab)
Qiang Fu (Tencent AI Lab)
Junliang Xing (Tsinghua University)
Jian Cheng (Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences; AiRiA)

1.3. Journal/Conference

The paper was published in the proceedings of a top-tier Artificial Intelligence conference, likely the AAAI Conference on Artificial Intelligence or a similar venue, given the nature of the work and citations to other AAAI papers. Such venues are highly reputable and influential in the AI research community, particularly for topics in game theory, multi-agent systems, and reinforcement learning.

1.4. Publication Year

The publication year is not explicitly stated in the provided text snippet, but based on the references (e.g., Xu et al., 2022 is cited), it is likely 2023 or later.

1.5. Abstract

Counterfactual Regret Minimization (CFR) algorithms are iterative methods widely used to solve imperfect-information games. While recent CFR variants like $CFR+$ and DCFR have significantly improved convergence rates by non-uniformly weighting (discounting) earlier iterations, they rely on fixed, manually-specified discounting schemes. This limitation restricts their full potential.

This paper proposes Dynamic Discounted CFR (DDCFR), a novel framework that learns a dynamic, automatically-adjusted discounting scheme. The core methodology involves formalizing the CFR iteration process as a Markov Decision Process (MDP) and transforming the discounting scheme learning into a policy optimization problem within this MDP. The learned scheme dynamically adjusts iteration weights at runtime using available information. Extensive experiments demonstrate that DDCFR's dynamic discounting scheme offers strong generalization capabilities, leading to faster convergence and improved performance across various games, including those not seen during training.

1.6. Original Source Link

The original source link is /files/papers/6913f6f32db53125aacc49fd/paper.pdf. This appears to be a direct link to the PDF file, likely from a conference or institutional repository, indicating it is an officially published paper or a publicly available preprint.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by the paper is the inefficiency and suboptimality of fixed, manually-specified discounting schemes in existing Counterfactual Regret Minimization (CFR) variants for solving imperfect-information games.

This problem is important because imperfect-information games (IIGs) model strategic interactions where players lack complete knowledge of the game state or opponents' private information, which is omnipresent in real-world decision-making. Solving these games is theoretically and practically crucial for developing robust artificial intelligence. CFR and its variants are among the most successful approaches for computing Nash equilibria in IIGs, underpinning breakthroughs in games like poker. However, current CFR variants, despite their advancements (e.g., $CFR+$ , DCFR), rely on static discounting rules (e.g., DCFR's fixed $\alpha, \beta, \gamma$ hyperparameters).

The specific challenges or gaps in prior research include:

Suboptimality: A fixed discounting scheme, manually chosen, cannot be optimal for all games or even across different phases of the same game, as there are limitless theoretical schemes.
Lack of Flexibility: Pre-determined schemes cannot adapt to runtime information or changing conditions during the iterative learning process, potentially hindering convergence.
Scalability: Manually tuning hyperparameters for diverse IIGs is impractical and resource-intensive.

The paper's entry point and innovative idea is to introduce a dynamic, automatically-learned discounting scheme for CFR algorithms. Instead of relying on human intuition or extensive grid searches, DDCFR transforms the problem of finding optimal discounting weights into a policy optimization problem within a Markov Decision Process (MDP), allowing the weights to be adjusted on the fly based on the current state of the iteration process.

2.2. Main Contributions / Findings

The paper makes three primary novel contributions:

Proposed DDCFR Framework: It introduces DDCFR, the first equilibrium-finding framework that utilizes a dynamic, automatically-learned scheme to discount prior iterations. This moves beyond fixed, manually-specified discounting rules, offering greater flexibility and potential for improved performance.
MDP Formalization for Discounting Scheme Learning: The paper formalizes the iterative process of CFR as a carefully designed Markov Decision Process (MDP). This transformation allows the problem of learning an effective discounting scheme to be framed as a policy optimization task within the MDP, where an agent observes the state of the CFR process and outputs dynamic discounting weights.
Effective Training Algorithm and Generalization: It designs an effective training algorithm based on Evolution Strategies (ES) to optimize the discounting policy. The learned scheme demonstrates strong generalization ability, achieving competitive performance on new games not seen during training, significantly outperforming fixed discounting schemes on both training and testing games.

The key conclusion is that dynamic, automatically-learned discounting schemes, which adapt to runtime information, lead to faster convergence and improved performance in solving imperfect-information games compared to static, manually-specified schemes. This approach provides a performance booster that can be applied to various existing CFR variants.

3.1. Foundational Concepts

To understand the DDCFR framework, it is crucial to grasp several foundational concepts in game theory and artificial intelligence:

3.1.1. Imperfect-Information Games (IIGs)

An imperfect-information game (IIG) is a type of game where players do not have complete knowledge about the state of the game or the actions taken by other players. This contrasts with perfect-information games (like chess) where all players know everything that has occurred. In IIGs, players often have private information (e.g., hidden cards in poker) and must make decisions under uncertainty. IIGs model many real-world scenarios, making their study highly relevant.

3.1.2. Two-Player Zero-Sum Games

A two-player zero-sum game is a type of game involving two players where one player's gain directly corresponds to the other player's loss, meaning their utilities sum to zero (or a constant). In such games, there is a direct conflict of interest, and finding optimal strategies often involves reaching a balance where neither player can unilaterally improve their outcome. The paper focuses on solving two-player zero-sum IIGs.

3.1.3. Nash Equilibrium

A Nash equilibrium (Nash, 1950) is a stable state in a game where no player can improve their outcome by unilaterally changing their strategy, assuming the other players' strategies remain unchanged. It represents a set of strategies, one for each player, such that each player's strategy is an optimal response to the strategies of the others. The goal of solving IIGs is often to find an (approximate) Nash equilibrium.

3.1.4. Exploitability

Exploitability is a measure of how far a given strategy profile is from a Nash equilibrium. For a strategy profile $\sigma$ , the exploitability $e(\sigma)$ quantifies the maximum expected utility gain an opponent could achieve by switching to their best response, assuming the other players' strategies remain fixed. In a Nash equilibrium, the exploitability is zero. An $\epsilon$ -Nash equilibrium is a strategy profile where no player has an exploitability higher than $\epsilon$ . The paper uses exploitability as the primary metric for convergence and performance. The paper defines exploitability of a strategy profile $\bar{\sigma}$ as: $ e ( \bar { \sigma } ) = \sum _ { i \in \mathcal { N } } e _ { i } \bar { ( } \sigma _ { i } \bar { ) } / | { \mathcal { N } } | $ where $e_i(\bar{\sigma}_i)$ represents the exploitability of player $i$ 's strategy. More specifically, for player $i$ , the exploitability $e_i(\sigma_i)$ of their strategy $\sigma_i$ against other players' strategies $\sigma_{-i}$ is defined as the difference between the maximum utility player $i$ could get by playing a best response and the utility they get from playing $\sigma_i$ : $ e_i(\sigma_i) = u_i(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_i(\sigma_i, \sigma_{-i}) $ where $\mathrm{BR}(\sigma_{-i})$ is the best response strategy for player $i$ given $\sigma_{-i}$ .

3.1.5. Extensive-Form Games

Extensive-form games (Osborne & Rubinstein, 1994) are a tree-based formalism used to describe IIGs.

Game Tree: Represents all possible sequences of actions.
Histories ( $h$ ): Sequences of actions taken from the start of the game. Terminal histories ( $\mathcal{Z}$ ) are those where no further actions are available.
Information Sets ( $I$ ): For a player $i$ , an information set $I \in \mathcal{T}_i$ is a set of histories that player $i$ cannot distinguish between. If player $i$ is to move at any history in $I$ , they cannot tell which specific history in $I$ they are at. Therefore, their strategy must be the same for all histories within that information set.
Strategy ( $\sigma_i(I)$ ): A probability distribution over available actions for player $i$ at an information set $I$ . $\sigma_i(I, a)$ is the probability of taking action $a$ .
Utility Function ( $u_i(z)$ ): Assigns a payoff to player $i$ for each terminal history $z$ .
Reach Probabilities ( $\pi^\sigma(h)$ ): The joint probability of reaching a history $h$ if all players play according to strategy profile $\sigma$ . This can be decomposed into contributions from player $i$ ( $\pi_i^\sigma(h)$ ) and all other players ( $\pi_{-i}^\sigma(h)$ ). Similarly, information set reach probability $\pi^\sigma(I)$ is the sum of reach probabilities of histories in $I$ .
Counterfactual Value ( $v_i^\sigma(I)$ ): The expected utility for player $i$ at an information set $I$ , assuming player $i$ tries to reach it, and other players play according to $\sigma_{-i}$ . The counterfactual value of taking a specific action $a$ in $I$ is $v_i^\sigma(I, a)$ .

3.1.6. Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker (agent). An MDP is defined by:

States ( $S$ ): The possible situations an agent can be in.
Actions ( $A$ ): The choices available to the agent in each state.
Transition Function ( $P^G$ ): A probability distribution over the next state given the current state and action. In the context of DDCFR, this describes how the CFR iteration process evolves.
Reward Function ( $\hat{R}^G$ ): A value received by the agent after taking an action, indicating the desirability of that action's outcome. The goal of the agent is to maximize its cumulative reward.

3.1.7. Evolution Strategies (ES)

Evolution Strategies (ES) (Wierstra et al., 2014; Salimans et al., 2017) are a class of black-box optimization algorithms inspired by biological evolution. They are used to optimize objective functions without requiring gradient information. Instead, ES works by:

Sampling: Generating a population of perturbed versions of the model parameters (e.g., by adding Gaussian noise).
Evaluation: Evaluating the objective function (fitness) for each perturbed parameter set.
Update: Updating the main model parameters based on the fitness of the perturbed versions, typically by moving towards the direction of higher fitness. ES is known for its scalability, ease of parallelization, and robustness to sparse or noisy rewards, making it suitable for problems where traditional gradient-based methods are challenging.

3.2. Previous Works

The paper builds upon and differentiates itself from several key CFR variants and related optimization techniques:

3.2.1. Counterfactual Regret Minimization (CFR)

Introduced by Zinkevich et al. (2007), CFR is a fundamental iterative algorithm for finding Nash equilibria in extensive-form IIGs. It works by iteratively traversing the game tree, calculating instantaneous regrets for not choosing certain actions, accumulating these regrets, and then updating strategies using regret-matching (Hart & Mas-Colell, 2000; Cesa-Bianchi & Lugosi, 2006). The time-averaged strategy profile converges to a Nash equilibrium.

Instantaneous Regret ( $r_i^t(I, a)$ ): The difference between the counterfactual value of taking action $a$ in information set $I$ and the counterfactual value of playing the current strategy in $I$ . $ \boldsymbol { r } _ { i } ^ { t } ( \boldsymbol { I } , \boldsymbol { a } ) = \boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } , \boldsymbol { a } ) - \boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } ) $ where $\boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } , \boldsymbol { a } )$ is the counterfactual value of action $a$ in information set $I$ under strategy $\sigma^t$ , and $\boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } )$ is the counterfactual value of information set $I$ under strategy $\sigma^t$ .
Cumulative Regret ( $R_i^t(I, a)$ ): The sum of instantaneous regrets for an action $a$ in an information set $I$ over all iterations up to $t$ . $ R _ { i } ^ { t } ( I , a ) = \sum _ { k = 1 } ^ { t } r _ { i } ^ { k } ( I , a ) $
Regret-Matching: The strategy for the next iteration $\sigma_i^{t+1}(I,a)$ is computed based on positive cumulative regrets: $ \begin{array} { r } { \sigma _ { i } ^ { t + 1 } ( I , a ) = \left{ \begin{array} { l l } { \frac { R _ { i } ^ { t , + } ( I , a ) } { \sum _ { a ^ { \prime } \in A ( I ) } R _ { i } ^ { t , + } ( I , a ^ { \prime } ) } , } & { \mathrm { i f ~ } \sum _ { a ^ { \prime } \in A ( I ) } R _ { i } ^ { t , + } \left( I , a ^ { \prime } \right) > 0 } \ { \frac { 1 } { | A ( I ) | } , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. } \end{array} $ where $R_i^{t,+}(I, a) = \max(R_i^t(I, a), 0)$ . This means only actions with positive regret (i.e., actions that would have been better to take) contribute to the future strategy.
Average Strategy ( $\bar{\sigma}_i^t(I,a)$ ): The overall strategy that converges to a Nash equilibrium is the average of strategies from all iterations, weighted by their reach probabilities: $ C _ { i } ^ { t } ( I , a ) = \sum _ { k = 1 } ^ { t } \left( \pi _ { i } ^ { \sigma ^ { k } } ( I ) \sigma _ { i } ^ { k } ( I , a ) \right) , ~ \bar { \sigma } _ { i } ^ { t } ( I , a ) = \frac { C _ { i } ^ { t } ( I , a ) } { \sum _ { a ^ { \prime } \in A ( I ) } C _ { i } ^ { t } ( I , a ^ { \prime } ) } , $ where $\pi_i^{\sigma^k}(I)$ is the information set reach probability for player $i$ using strategy $\sigma^k$ .

3.2.2. CFR+

$CFR+$ (Tammelin, 2014) significantly improved vanilla CFR's convergence rate. Its key modifications include:

Linear Discounting: Weights iteration $t$ by $t$ rather than uniformly, giving more importance to recent iterations.
Regret Resetting: Sets any action with negative cumulative regret to zero on each iteration, allowing for immediate reuse of promising actions.
Alternating Updates: Players update their strategies in an alternating fashion.

3.2.3. Discounted CFR (DCFR)

DCFR (Brown & Sandholm, 2019b) is a family of algorithms that generalizes discounting schemes, further accelerating convergence. It discounts both cumulative regrets and the cumulative strategy using three hyperparameters: $\alpha$ , $\beta$ , and $\gamma$ .

Discounted Cumulative Regret Update: $ R _ { i } ^ { t } ( I , a ) = \left{ \begin{array} { l l } { R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \alpha } } { ( t - 1 ) ^ { \alpha } + 1 } + r _ { i } ^ { t } ( I , a ) , } & { \mathrm { i f ~ } R _ { i } ^ { t - 1 } ( I , a ) > 0 } \ { R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \beta } } { ( t - 1 ) ^ { \beta } + 1 } + r _ { i } ^ { t } ( I , a ) , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. $ Here, positive cumulative regrets are multiplied by $\frac{(t-1)^\alpha}{(t-1)^\alpha + 1}$ and negative cumulative regrets by $\frac{(t-1)^\beta}{(t-1)^\beta + 1}$ before adding the current instantaneous regret $r_i^t(I,a)$ .
Discounted Cumulative Strategy Update: $ C _ { i } ^ { t } ( I , a ) = C _ { i } ^ { t - 1 } ( I , a ) \left( \frac { t - 1 } { t } \right) ^ { \gamma } + \pi _ { i } ^ { \sigma ^ { t } } ( I ) \sigma _ { i } ^ { t } ( I , a ) . $ The previous cumulative strategy $C_i^{t-1}(I,a)$ is discounted by $\left(\frac{t-1}{t}\right)^\gamma$ before adding the current strategy contribution. DCFR empirically uses fixed values like $\alpha = 1.5$ , $\beta = 0$ , and $\gamma = 2$ .

3.2.4. Predictive CFR+ (PCFR+) and PDCFR

Predictive CFR+ (Farina et al., 2021) is a state-of-the-art CFR variant based on predictive Blackwell approachability. It also uses a fixed quadratic discounting scheme for the average strategy, similar to the $\gamma$ term in DCFR (e.g., $\left(\frac{t-1}{t}\right)^\gamma$ with $\gamma=2$ ). PDCFR is a combination of $PCFR+$ and DCFR, also using fixed discounting.

3.2.5. Greedy Weights

Greedy Weights (Zhang et al., 2022) is a recently proposed algorithm for normal-form games that greedily weighs iterates based on observed regrets. While adaptable to extensive-form games, the paper finds its performance inferior to DDCFR in that domain.

3.3. Technological Evolution

The field of solving imperfect-information games has seen a steady progression in CFR algorithms:

Vanilla CFR (2007): Established the foundational iterative regret minimization approach with uniform weighting of iterations. Provided theoretical convergence guarantees.
CFR+ (2014): Introduced crucial practical improvements, notably linear discounting and regret resetting, drastically speeding up convergence and enabling solutions for games like heads-up limit Texas Hold'em. This showed the power of non-uniform weighting.
Discounted CFR (DCFR) (2019): Systematically investigated discounting schemes, generalizing the concept with $\alpha, \beta, \gamma$ hyperparameters. This demonstrated further significant convergence acceleration, but still relied on fixed, manually-tuned parameters.
Predictive CFR+ (PCFR+) (2021): Incorporated concepts from Blackwell approachability, offering another avenue for speedup, again with fixed discounting rules.
Dynamic Discounted CFR (DDCFR) (Current Paper): This work represents the next evolutionary step by addressing the primary limitation of prior discounting variants: their fixed, manually-specified nature. DDCFR introduces a dynamic, automatically-learned discounting scheme, effectively turning the hyperparameter optimization problem into a policy optimization problem within an MDP, thus making the CFR process adaptive and potentially more robust and efficient across diverse game types.

3.4. Differentiation Analysis

The core difference and innovation of DDCFR compared to previous CFR variants like $CFR+$ and DCFR lie in the dynamic and automatically-learned nature of its discounting scheme.

Fixed vs. Dynamic Discounting:
- $CFR+$ uses a fixed linear discounting scheme.
- DCFR uses a family of fixed discounting schemes determined by manually chosen hyperparameters ( $\alpha, \beta, \gamma$ ). While these parameters allow for different schemes, the chosen values are static throughout the game and across different games.
- DDCFR (and its framework applied to $PCFR+$ and PDCFR), in contrast, learns a policy that dynamically adjusts the discounting weights (represented by $\alpha_t, \beta_t, \gamma_t, \tau_t$ ) on the fly at each iteration $t$ . This adjustment is based on runtime information about the CFR process, such as the current iteration number and exploitability.
Manual vs. Automatic Learning:
- Previous variants require manual specification or empirical tuning of their discounting parameters (e.g., DCFR's $\alpha=1.5, \beta=0, \gamma=2$ were empirically found to be best). This process is unscalable, suboptimal, and requires domain expertise for each new game.
- DDCFR automatically learns this scheme by formulating the problem as a policy optimization within a Markov Decision Process (MDP). An agent (neural network) learns to output optimal discounting parameters based on the observed state.
Generalization vs. Game-Specific Tuning:
- Manually tuned schemes are often optimized for specific games or a limited set of games and may not generalize well to unseen games.
- DDCFR is explicitly designed for generalization. By training on a diverse set of games and using a game-agnostic state representation, the learned policy can effectively adapt to new games without further modification or tuning.
Optimization Approach:
- Traditional hyperparameter optimization methods (grid search, Bayesian optimization) find the best fixed values.
- DDCFR uses Evolution Strategies (ES) to train a neural network policy that outputs dynamic weights. This is a novel application of ES to directly learn an adaptive CFR component.
  
  In essence, DDCFR provides a meta-learning approach to CFR, where the algorithm itself learns how to best configure its iterative process, rather than relying on fixed rules.

4. Methodology

4.1. Principles

The core idea behind Dynamic Discounted CFR (DDCFR) is to move beyond fixed, manually-specified discounting schemes in Counterfactual Regret Minimization (CFR) algorithms by learning a dynamic, adaptive scheme. The theoretical basis is that CFR's iterative process can be framed as a sequential decision-making problem, which is naturally modeled as a Markov Decision Process (MDP). Within this MDP, an intelligent agent (represented by a neural network) can learn a policy to choose optimal discounting weights at each iteration, maximizing the efficiency of convergence to a Nash equilibrium. The intuition is that different phases of the learning process (e.g., early vs. late iterations) or different game characteristics might benefit from different discounting strategies, and a dynamic scheme can adapt to these nuances.

4.2. Core Methodology In-depth (Layer by Layer)

The DDCFR framework encapsulates the CFR iteration process into an environment and models the discounting scheme as an agent interacting with it. This interaction constitutes an MDP.

4.2.1. The DDCFR Framework Overview

The agent (the dynamic discounting scheme) receives the current state of the CFR iteration process and outputs an action, which consists of the discounting weights to be used for a certain number of subsequent CFR iterations. This process repeats until a maximum number of iterations is reached. The agent's ultimate goal is to learn an optimal discounting policy that can dynamically select appropriate weights, thereby minimizing the exploitability of the obtained average strategies and finding an approximate Nash equilibrium promptly.

The DDCFR framework can be visualized as follows, illustrating the overall iterative process:

Figure 1: The DDCFR framework.
该图像是示意图，展示了动态折扣反事实遗憾最小化（DDCFR）框架的工作原理。图中包含多个组件，如迭代归一化、状态、动作、以及环境中的博弈 $G$ 。各部分颜色编码表示了当前策略、瞬时遗憾、累积遗憾更新、策略匹配等关键步骤，以及奖励信号的反馈机制。整体框架表明如何动态调整策略以提高收敛速度和性能。

Figure 1: The DDCFR framework.

4.2.2. MDP Formalization

The interaction process between the agent and the CFR environment for a given game $G$ is formalized as a Markov Decision Process (MDP), represented by $( G , S , A , P ^ { G } , \hat { R } ^ { G } )$ .

4.2.2.1. The Game $G$

Each game $G$ acts as a distinct environment for the MDP. The ultimate objective is to train a discounting policy that not only performs well on the training games but also generalizes effectively to new, unseen games.

4.2.2.2. The State Space $S$

The state $s_t \in S$ represents the information observed by the agent at iteration $t$ within game $G$ . To ensure the learned policy's generalization ability across different games, the state representation is designed to be game-agnostic. It consists of two key components:

Normalized Iteration ( $\hat{t}$ ): This is the ratio of the current iteration $t$ to the total number of iterations $T$ . It is calculated as $\hat{t} = t/T$ . This provides the agent with a sense of the progress within the overall CFR run.
Normalized Exploitability ( $\hat{E}_{t-1}^G$ ): This is a normalized logarithmic measure of the exploitability of the average strategy from the previous iteration t-1. It is calculated as: $ \hat { E } _ { t - 1 } ^ { G } = \frac { \log E _ { t - 1 } ^ { G } - \log E _ { \operatorname* { m i n } } } { \log E _ { 1 } ^ { G } - \log E _ { \operatorname* { m i n } } } $ where:
- $E_{t-1}^G$ is the exploitability of the average strategy in game $G$ at iteration t-1.
- $E_1^G$ is the exploitability at iteration 1.
- $E_{\mathrm{min}}$ is the minimum reachable exploitability, which is set to $10^{-12}$ to avoid issues with $\log(0)$ and provide a stable lower bound. The logarithmic operation is crucial here because exploitability values in CFR algorithms often decrease rapidly in early iterations and then slow down significantly. Logarithmic normalization ensures a more even distribution of exploitability values across the state space, which facilitates the neural network agent's learning process.

4.2.2.3. The Action Space $A$

The agent's action $\hat{a}_t \in A$ at iteration $t$ is a vector of four numbers: $\hat{a}_t = [ \alpha_t, \beta_t, \gamma_t, \tau_t ]$ . These parameters are directly used to determine the discounting weights for the subsequent CFR iterations, similar to DCFR (Brown & Sandholm, 2019b).

The discounting weights are calculated as follows:

Positive Cumulative Regrets Discounting Factor: The previous positive cumulative regrets $R_i^{t-1}(I,a)$ are multiplied by $\frac{(t-1)^{\alpha_t}}{(t-1)^{\alpha_t}+1}$ before adding the instantaneous regret $r_i^t(I,a)$ . This leads to the update rule: $ R _ { i } ^ { t } ( I , a ) = R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \alpha _ { t } } } { ( t - 1 ) ^ { \alpha _ { t } } + 1 } + r _ { i } ^ { t } ( I , a ) \quad \mathrm { i f ~ } R _ { i } ^ { t - 1 } ( I , a ) > 0 $ Here, $\alpha_t$ controls how aggressively positive regrets are discounted. A smaller $\alpha_t$ means stronger discounting of older regrets.
Negative Cumulative Regrets Discounting Factor: The previous negative cumulative regrets $R_i^{t-1}(I,a)$ are multiplied by $\frac{(t-1)^{\beta_t}}{(t-1)^{\beta_t}+1}$ before adding the instantaneous regret $r_i^t(I,a)$ . This leads to the update rule: $ R _ { i } ^ { t } ( I , a ) = R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \beta _ { t } } } { ( t - 1 ) ^ { \beta _ { t } } + 1 } + r _ { i } ^ { t } ( I , a ) \quad \mathrm { o t h e r w i s e } $ Here, $\beta_t$ controls the discounting of negative regrets.
Cumulative Strategy Discounting Factor: The previous cumulative strategy $C_i^{t-1}(I,a)$ is discounted by $\left(\frac{t-1}{t}\right)^{\gamma_t}$ before adding the current iteration's strategy contribution. This leads to the update rule: $ C _ { i } ^ { t } ( I , a ) = C _ { i } ^ { t - 1 } ( I , a ) \left( \frac { t - 1 } { t } \right) ^ { \gamma _ { t } } + \pi _ { i } ^ { \sigma ^ { t } } ( I ) \sigma _ { i } ^ { t } ( I , a ) . $ Here, $\gamma_t$ controls how aggressively older strategies are discounted in the average strategy calculation. A larger $\gamma_t$ means stronger discounting of older strategies.
Duration ( $\tau_t$ ): This parameter represents the number of subsequent CFR iterations for which the currently chosen discounting weights ( $\alpha_t, \beta_t, \gamma_t$ ) will be applied. Incorporating $\tau_t$ allows the policy to decide how frequently to adjust the hyperparameters.

The key distinction from DCFR is that DDCFR's $\alpha_t, \beta_t, \gamma_t$ are dynamically adjusted at runtime based on the state $s_t = [\hat{t}, \hat{E}_{t-1}^G]$ , rather than being fixed values throughout the entire iteration process or across different games.

4.2.2.4. The State Transition $P^G$

The state transition function $P^G: S \times A \to S$ describes how the CFR iteration process evolves. When the agent observes the current state $s_t$ and outputs an action $\hat{a}_t = [\alpha_t, \beta_t, \gamma_t, \tau_t]$ , DDCFR uses the calculated discounting weights for $\tau_t$ iterations. After these $\tau_t$ iterations, the state then transitions to $s_{t+\tau_t}$ . This allows the agent to control the granularity of its policy updates.

4.2.2.5. The Reward Function $\hat{R}^G$

The reward function $\hat{R}^G$ evaluates the agent's performance and guides the learning process. DDCFR uses a sparse reward setting, meaning the agent only receives a reward at the very end of the total CFR iterations ( $T$ ). The reward is defined as: $ \hat { R } ^ { G } = \log E _ { 1 } ^ { G } - \log E _ { T } ^ { G } $ where:

$E_1^G$ is the exploitability of the initial average strategy (at iteration 1) in game $G$ .
$E_T^G$ is the exploitability of the final average strategy (at the maximum number of iterations $T$ ) in game $G$ . The agent's objective is to maximize this reward, which corresponds to achieving the largest reduction in exploitability from the beginning to the end of the CFR process.

4.2.2.6. The Agent (Policy Network)

The agent, which represents the dynamic discounting scheme, is typically implemented as a neural network with parameters $\pmb{\theta}$ . This network, denoted as $\pi_{\pmb{\theta}}$ , takes the current state $s_t$ as input and outputs the action $\hat{a}_t$ : $ [ \alpha _ { t } , \beta _ { t } , \gamma _ { t } , \tau _ { t } ] = \hat { a } _ { t } = \pi _ { \pmb { \theta } } \big ( s _ { t } \big ) $ The overall objective for DDCFR is to learn parameters $\pmb{\theta}$ that maximize the average final reward across all training games $\mathbb{G}$ : $ f ( \pmb { \theta } ) = \frac { 1 } { | \mathbb { G } | } \sum _ { G \in \mathbb { G } } f ^ { G } ( \pmb { \theta } ) $ where $f^G(\pmb{\theta}) = \hat{R}^G$ for a specific game $G$ .

4.2.3. Theoretical Analysis

The paper provides a theoretical guarantee for DDCFR's convergence to a Nash equilibrium, provided the chosen hyperparameters stay within certain ranges.

Theorem 1. Assume that conduct DDCFR $T$ iterations in a two-player zero-sum game. If DDCFR selects hyperparameters as follows: $\alpha _ { t } \in [ 0 , 5 ]$ for $t { < } \frac { T } { 2 }$ and $\alpha _ { t } \in [ 1 , 5 ]$ for $t \geq \frac { T } { 2 }$ , $\beta _ { t } \in [ - 5 , 0 ]$ , $\gamma _ { t } \in [ 0 , 5 ]$ , then the weighted average strategy converges to a $\begin{array} { r } { 6 | \mathcal { T } | \Delta \left( \frac { 8 } { 3 } \sqrt { | \mathcal { A } | } + \frac { 2 } { \sqrt { T } } \right) / \sqrt { T } } \end{array}$ Nash equilibrium.

Explanation:

This theorem ensures that DDCFR, even with dynamic $\alpha_t, \beta_t, \gamma_t$ values, maintains the theoretical convergence properties characteristic of CFR algorithms.
The conditions on $\alpha_t$ (switching from [0, 5] to [1, 5] halfway through iterations) and the ranges for $\beta_t$ and $\gamma_t$ are crucial for the proof, which relies on bounding cumulative regrets and applying existing lemmas from DCFR and CFR literature.
The convergence rate is $\mathcal{O}(1/\sqrt{T})$ , which is typical for regret minimization algorithms, scaled by game-specific factors like the number of information sets ( $|\mathcal{T}|$ ), maximum actions ( $|\mathcal{A}|$ ), and utility range ( $\Delta$ ). This indicates that the algorithm will approach a Nash equilibrium, and the speed of approach depends on the square root of the number of iterations.

4.2.4. Optimization Through Evolution Strategies

Training the discounting policy $\pi_{\pmb{\theta}}$ presents several challenges:

Sparse Reward: The agent only receives a reward at the very end of the CFR iteration process, making it difficult for standard Reinforcement Learning (RL) algorithms to learn credit assignment over long horizons.
Long Time Horizons: CFR often requires thousands of iterations, leading to very long episodes in the MDP.
Multi-task Learning: The agent needs to generalize across various games (multi-task learning), which can be difficult for RL methods sensitive to hyperparameter tuning.

To overcome these challenges, the authors choose Evolution Strategies (ES) (Salimans et al., 2017) instead of traditional RL algorithms like DQN or PPO. ES is well-suited because it is:
Black-box Optimization: It does not require gradient information, making it robust to sparse or noisy rewards.
Scalable: Easy to parallelize across many CPU cores.
Robust: Tolerant of long time horizons and less sensitive to hyperparameters.

The ES-based training algorithm aims to estimate the gradient of the objective function $f(\pmb{\theta})$ (average reward across games) by applying Gaussian perturbations to the network parameters $\pmb{\theta}$ . The gradient estimation is given by: $ \nabla _ { \pmb { \theta } } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } \bar { f } ( \pmb { \theta } + \delta \epsilon ) = \frac { 1 } { \delta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } f ( \pmb { \theta } + \delta \epsilon ) \epsilon $ where:
$\pmb{\theta}$ are the agent's parameters.
$\delta$ is the noise standard deviation, controlling the magnitude of perturbations.
$\epsilon \sim \mathcal{N}(0, I)$ is a random vector sampled from a standard Gaussian distribution.
$f(\pmb{\theta} + \delta\epsilon)$ is the objective function (average reward) evaluated with perturbed parameters.

This gradient is approximated by sampling a population of perturbed parameters, evaluating their performance, and then using these evaluations to update the original parameters via stochastic gradient ascent.

The complete training procedure for DDCFR's discounting policy is outlined in Algorithm 1:

Algorithm 1: DDCFR's training procedure.

Input: Training games $\mathbb { G }$ , epochs $M$ , learning rate lr, population size $N$ , standard deviation $\delta$
1 Randomly initialize the agent's parameters $\check { \pmb { \theta } } ^ { 1 }$ ;
2 for $m = 1$ to $M$ do
3     Initialize population $\mathbb { P }$ with an empty list;
4     for $i = 1 , \ldots , \frac { N } { 2 }$ do
5         Sample $\epsilon _ { i } \stackrel { \sim } { \sim } \mathcal { N } ( 0 , I )$ ;
6         Add $\epsilon _ { i } , - \epsilon _ { i }$ to the population $\mathbb { P }$ ;
7     foreach $\epsilon \in \mathbb { P }$ do
8         foreach $G \in \mathbb { G }$ do
9             Compute $f ^ { G } ( { \pmb \theta } ^ { m } { + } \delta { \pmb \epsilon } )$ , (Algorithm 2);
10          $\begin{array} { r } { f ( \pmb { \theta } ^ { m } + \delta \pmb { \epsilon } ) \gets \frac { 1 } { | \mathbb { G } | } \sum _ { G \in \mathbb { G } } f ^ { G } ( \pmb { \theta } ^ { m } + \delta \pmb { \epsilon } ) } \end{array}$
11         Compute $f ^ { \prime } ( \pmb { \theta } ^ { m } { + } \delta \pmb { \epsilon } )$ using Eqn. 5;
I2      $\begin{array} { r } { \pmb { \theta } ^ { m + 1 } \gets \pmb { \theta } ^ { m } + \frac { l r } { \delta \cdot N } \sum _ { \epsilon \in \mathbb { P } } f ^ { \prime } ( \pmb { \theta } ^ { m } + \delta \pmb { \epsilon } ) \pmb { \epsilon } } \end{array}$

Output: The trained agent $\pi _ { \pmb { \theta } ^ { M + 1 } }$

The calculation process of $f^G(\pmb{\theta})$ (the reward for a single game) is detailed in Algorithm 2:

Algorithm 2: The calculation process of $f ^ { G } ( \pmb { \theta } )$ .

Input: Training game $G$ , agent $\pi _ { \theta }$ , total iterations $T$
1 $t \gets 1$ ;
2 while $t \leq T$ do
3      $\left[ \alpha _ { t } , \beta _ { t } , \gamma _ { t } , \tau _ { t } \right] = \hat { a } _ { t } = \pi _ { \pmb { \theta } } ( s _ { t } )$ ;
4     Use $[ \alpha _ { t } , \beta _ { t } , \gamma _ { t } ]$ to calculate discounting weights and perform $\tau _ { t }$ numbers of CFR iterations;
5      $t \gets t + \tau _ { t }$ ;

6 Compute exploitability $E _ { 1 } ^ { G } , E _ { T } ^ { G }$ ; Output: $f ^ { G } ( { \pmb \theta } ) = \log E _ { 1 } ^ { G } - \log E _ { T } ^ { G }$

Three acceleration techniques are employed to improve training efficiency:

Antithetic Estimator: This technique reduces variance in the gradient estimation by sampling perturbations in pairs ( $\epsilon_i$ and $-\epsilon_i$ ). The gradient is calculated using the symmetric difference: $ \nabla _ { \theta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } f \big ( \theta + \delta \epsilon \big ) = \frac { 1 } { 2 \delta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } \left[ f \big ( \theta + \delta \epsilon \big ) - f \big ( \theta - \delta \epsilon \big ) \right] \dot { \epsilon } $ This provides an unbiased estimator with lower variance compared to using only positive perturbations.
Fitness Shaping: A rank transformation is applied to the objective function values $f(\pmb{\theta})$ before computing the gradient. For a parameter $\theta_k$ with the $k$ -th largest value in a population of size $N$ , the shaped result $f'(\theta_k)$ is: $ f ^ { \prime } ( \theta _ { k } ) = \frac { \operatorname* { m a x } \left( 0 , \log \left( \frac { N } { 2 } + 1 \right) - \log ( k ) \right) } { \sum _ { j = 1 } ^ { P } \operatorname* { m a x } \left( 0 , \log \left( \frac { N } { 2 } + 1 \right) - \log ( j ) \right) } - \frac { 1 } { N } . $ This approach makes the gradient dependent on the relative ordering of fitness values rather than their absolute magnitudes, increasing robustness to variations in reward scales.
Parallel Evaluation: The performance of each perturbed parameter set is evaluated under each training game in parallel across multiple CPU cores, significantly speeding up the overall training process.

4.2.5. Additional Computation Cost Analysis

DDCFR introduces minor additional computation costs compared to DCFR, primarily in three areas:

Feature Calculations: The state representation requires the current iteration number (effortlessly obtainable) and the normalized exploitability. The calculation of exploitability can be performed in parallel with the instantaneous regret calculation. This is possible because exploitability depends on the average strategy ( $\bar{\sigma}^{t-1}$ ) from the previous iteration, while instantaneous regret depends on the current strategy ( $\sigma^t$ ). By executing these two computations concurrently using multiprocessing, there is effectively no additional wall-clock time overhead.
Network Inference: The time overhead for the policy network $\pi_{\pmb{\theta}}$ to infer the discounting weights is negligible compared to the total CFR iteration time. Experimental results in Appendix D show DDCFR only incurs an average increase of $0.22\%$ in running time compared to DCFR.
Policy Training: The training cost for the discounting policy is amortized. Once trained, the policy can be applied directly to numerous new games without further training or modification, making the initial training effort worthwhile over a wide range of applications.

5. Experimental Setup

5.1. Datasets

The experiments use a selection of commonly used imperfect-information games (IIGs) from the research community. These games vary in complexity and size, providing a diverse testbed for evaluating DDCFR's performance and generalization ability.

Training Games: The training games are chosen to be relatively small to speed up the training process while still exhibiting typical characteristics found in larger IIGs.

Kuhn Poker (Kuhn, 1950): A simplified poker game with a 3-card deck (J, Q, K), one bet chance per player.
Goofspiel-3 (Ross, 1971): A bidding card game with 3 cards, where players bid on revealed point cards. This is an imperfect information variant where players only know if they won/lost the bid, not the opponent's card.
Liar's Dice-3 (Lisy et al., 2015): A dice game where players roll a 3-sided die secretly and bid on the dice outcomes.
Small Matrix: A custom game where player 1 has five actions (rock, paper, scissors, A1, A2), and player 2 has three (rock, paper, scissors). It's a modified rock-paper-scissors where A1/A2 actions by player 1 result in large losses, simulating situations with high-loss actions.

Testing Games: Eight relatively large and challenging IIGs are used to evaluate the generalization performance of the learned discounting scheme on unseen games.

Battleship-2 (Farina et al., 2019): A board game where players secretly place a $1 \times 2$ ship on a $2 \times 2$ grid and take turns firing.
Battleship-3 (Farina et al., 2019): A larger version of Battleship with a $2 \times 3$ grid.
Goofspiel-4 (Ross, 1971): A larger version of Goofspiel with 4 cards.
Liar's Dice-4 (Lisy et al., 2015): A larger version of Liar's Dice with 4-sided dice.
Leduc Poker (Southey et al., 2005): A larger poker game with a 6-card deck (two suits, three ranks) and two betting rounds, one public card revealed after the first round.
Big Leduc Poker: A more complex version of Leduc Poker with a 24-card deck (twelve ranks) and allowing up to six raises per round.
HUNL Subgame-3 (Brown & Sandholm, 2018): A subgame of Heads-Up No-Limit Texas Hold'em (HUNL) starting at the final betting round with $500 in the pot.
HUNL Subgame-4 (Brown & Sandholm, 2018): Another HUNL subgame starting at the final betting round with $3,750 in the pot.

The choice of these games, both for training and testing, ensures a strong diversity in rules, scales, and characteristics, which is crucial for testing the generalization ability of the learned policy.

The sizes of the games used are reported in Table 4 of the original paper: The following are the results from Table 4 of the original paper:

Game	#Histories	#Infosets	#Terminal histories	Depth	Max size of infosets
Small Matrix	21	2	15	3	5
Kuhn Poker	58	12	30	6	2
Goofspiel-3	67	16	36	5	4
Liar's Dice-3	1,147	192	567	10	3
Battleship-2	10,069	3,286	5,568	9	4
Battleship-3	732,607	81,027	552,132	9	7
Goofspiel-4	1,077	162	576	7	14
Liar's Dice-4	8,181	1024	4,080	12	4
Leduc Poker	9,457	936	5,520	12	5
Big Leduc Poker	6,178,561	100,800	3,953,424	20	23
HUNL Subgame-3	398,112,843	69,184	261,126,360	10	1,980
HUNL Subgame-4	244,005,483	43,240	158,388,120	8	1,980

5.2. Evaluation Metrics

The primary evaluation metric used in the paper is exploitability.

5.2.1. Exploitability ( $e(\bar{\sigma})$ )

Conceptual Definition: Exploitability quantifies how far a given strategy profile (a set of strategies for all players) is from a Nash equilibrium. A strategy profile with zero exploitability is a Nash equilibrium, meaning no player can improve their outcome by unilaterally changing their strategy. A lower exploitability value indicates a closer approximation to a Nash equilibrium. For two-player zero-sum games, exploitability is often defined as twice the maximum utility one player can gain by playing a best response against the other player's fixed strategy. The paper defines the exploitability of a strategy profile as the average exploitability across all players. It directly measures the approximation error to the Nash equilibrium.
Mathematical Formula: The paper defines the exploitability of a strategy profile $\bar{\sigma}$ as: $ e ( \bar { \sigma } ) = \frac { 1 } { | \mathcal { N } | } \sum _ { i \in \mathcal { N } } e _ { i } ( \bar { \sigma } _ { i } ) $ where $e_i(\bar{\sigma}_i)$ is the exploitability of player $i$ 's strategy. For a single player $i$ , the exploitability of their strategy $\bar{\sigma}_i$ against the opponent's strategy $\bar{\sigma}_{-i}$ is: $ e _ { i } ( \bar { \sigma } _ { i } ) = u _ { i } ( \mathrm { B R } ( \bar { \sigma } _ { - i } ) , \bar { \sigma } _ { - i } ) - u _ { i } ( \bar { \sigma } _ { i } , \bar { \sigma } _ { - i } ) $
Symbol Explanation:
- $e(\bar{\sigma})$ : The exploitability of the strategy profile $\bar{\sigma}$ .
- $|\mathcal{N}|$ : The number of players in the game (for two-player games, $|\mathcal{N}|=2$ ).
- $e_i(\bar{\sigma}_i)$ : The exploitability of player $i$ 's strategy $\bar{\sigma}_i$ .
- $u_i(\cdot, \cdot)$ : The expected utility function for player $i$ .
- $\mathrm{BR}(\bar{\sigma}_{-i})$ : The best response strategy for player $i$ against the combined strategies of all other players $\bar{\sigma}_{-i}$ . This is the strategy that maximizes player $i$ 's expected utility given $\bar{\sigma}_{-i}$ .
- $\bar{\sigma}_i$ : Player $i$ 's current average strategy.
- $\bar{\sigma}_{-i}$ : The average strategies of all players except player $i$ .
  
  The reward function for DDCFR's training also relies on exploitability, specifically the reduction in logarithmic exploitability: $\hat{R}^G = \log E_1^G - \log E_T^G$ .

5.3. Baselines

The paper compares DDCFR against several strong existing CFR variants:

CFR+ (Tammelin, 2014): A milestone CFR variant known for its linear discounting scheme, regret resetting, and alternating updates, significantly faster than vanilla CFR.
DCFR (Brown & Sandholm, 2019b): A family of algorithms that systematically discounts cumulative regrets and strategies using fixed hyperparameters ( $\alpha=1.5, \beta=0, \gamma=2$ ), considered one of the most efficient equilibrium-finding algorithms.
Predictive CFR+ (PCFR+) (Farina et al., 2021): A state-of-the-art CFR variant based on predictive Blackwell approachability, which uses a fixed quadratic discounting scheme.
PDCFR: An algorithm that combines $PCFR+$ and DCFR, also using fixed discounting schemes.
Greedy Weights (Zhang et al., 2022): A dynamic discounted regret minimization algorithm initially designed for normal-form games, adapted for extensive-form games for comparison.

5.4. Training Details

Noise Standard Deviation ( $\delta$ ): Fixed at 0.5 for Evolution Strategies (ES).
Population Size ( $N$ ): Set to 100 for ES.
Parallelization: The evaluation of perturbed parameters is distributed across 200 CPU cores to speed up training.
Action Space Ranges:
- $\alpha_t, \gamma_t$ : Continuous values in [0, 5].
- $\beta_t$ : Continuous values in $[-5, 0]$ . These ranges are selected based on Theorem 1 for theoretical convergence guarantees.
- $\tau_t$ : Discrete values from the set $\{1, 2, 5, 10, 20\}$ , representing the number of iterations to apply the chosen discounting weights.
Policy Network Architecture ( $\pi_{\pmb{\theta}}$ ): A neural network consisting of three fully-connected layers. Each layer has 64 units and uses ELU activation functions. The network takes the current state ( $[\hat{t}, \hat{E}_{t-1}^G]$ ) as input and outputs values for $\alpha_t, \beta_t, \gamma_t$ , and a probability distribution over the discrete values of $\tau_t$ .
Optimizer: Adam optimizer with a learning rate (lr) of 0.01.
Training Epochs ( $M$ ): The agent is trained for 1000 epochs.
Training Duration: Approximately 24 hours.
CFR Iterations ( $T$ ): The total number of CFR iterations for each game within an epoch is set to 1000, which is typically sufficient for convergence to small exploitability values.
CFR Variant Techniques: All CFR variants used in the experiments (including DDCFR and baselines) incorporate the alternating-updates technique, where players update their strategies one after another.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DDCFR consistently achieves faster convergence and lower exploitability compared to $CFR+$ and DCFR, especially on testing games, highlighting its strong generalization ability.

6.1.1. Performance on Training Games

The performance comparison on training games (Small Matrix, Kuhn Poker, Goofspiel-3, Liar's Dice-3) shows that DDCFR converges faster to a lower exploitability. This is expected as these games were part of the optimization objective.

The following are the results from Figure 2 of the original paper:

Figure 3: Comparison of DDCFR against two discounting CFR variants on eight testing games.
该图像是图表，展示了DDCFR与两个折扣CFR变体在八个测试游戏中的可利用性（exploitability）对比。图中包含不同算法在迭代次数（Iterations）变化下的表现，强调了DDCFR在速度和性能上的优势。

Figure 2: Comparison of DDCFR against two discounting CFR variants on four training games.

Analysis of Figure 2:

Small Matrix: DDCFR shows the steepest decline in exploitability, reaching a significantly lower value than $CFR+$ and DCFR by 1000 iterations.
Kuhn Poker: All three algorithms perform well, but DDCFR still achieves slightly better exploitability at the later iterations.
Goofspiel-3: DDCFR clearly outperforms $CFR+$ and DCFR, showing a faster convergence trajectory and a lower final exploitability.
Liar's Dice-3: Similar to Goofspiel-3, DDCFR demonstrates superior performance, indicating its learned scheme is effective even in more complex training environments. In summary, DDCFR successfully optimizes its performance on the training games, validating the effectiveness of the ES-based learning process.

6.1.2. Performance on Testing Games

A more crucial validation of DDCFR is its performance on unseen testing games, demonstrating its generalization ability. DDCFR maintains competitive performance and often outperforms $CFR+$ and DCFR.

The following are the results from Figure 3 of the original paper:

Figure 4: The learned discounting scheme behaves differently in various games yet exhibits a similar trend. Compared with DCFR's fixed discounting scheme (we can view DCFR as a special case of DDCFR,…
该图像是图表，展示了动态折扣CFR (DDCFR) 中不同参数 $\alpha_t$ 、 $\beta_t$ 、 $\gamma_t$ 和 $\tau_t$ 的值随迭代次数变化的趋势。各条线代表不同游戏（如“小矩阵”、“谎言骰子3”、“战舰3”、“大Leduc扑克”以及“所有游戏”）中的表现。与固定折扣方案DCFR相比，DDCFR在早期迭代时表现出更激进的折扣策略，随着迭代的进行变得更加温和。

Figure 3: Comparison of DDCFR against two discounting CFR variants on eight testing games.

Analysis of Figure 3:

Battleship-2, Battleship-3, Goofspiel-4, Liar's Dice-4, Leduc Poker, Big Leduc Poker, HUNL Subgame-3, HUNL Subgame-4: Across all eight testing games, DDCFR consistently achieves lower exploitability than $CFR+$ and DCFR in the later iterations (500-1000).
Significant Improvement: The paper states that DDCFR yields an average reduction in exploitability of $42\%$ compared to DCFR on these testing games. This is a remarkable improvement given DCFR's already strong performance.
Generalization: The consistent outperformance on unseen games strongly supports the claim that DDCFR's dynamic discounting scheme has a robust generalization ability. This is attributed to the careful design of the training games, the game-agnostic state space, and the scalable ES training algorithm.

6.1.3. Overall Iteration Performance

For completeness, the paper includes figures showing all iterations (1 to 1000) for training and testing games in the appendices.

The following are the results from Figure 5 of the original paper:

Figure 6: All iterations of DDCFR and discounting CFR variants on eight testing games.
该图像是一个展示不同CFR算法在八个测试游戏上的可利用性（Exploitability）变化的图表。图中展示了CFR+、DCFR和DDCFR的迭代过程，横轴为迭代次数，纵轴为可利用性。DDCFR的动态折扣方案显著改善了收敛速度。

Figure 5: All iterations of DDCFR and discounting CFR variants on four training games. This figure provides the full exploitability curves for the training games, confirming that the performance differences are visible across the entire iteration range, not just the later stages.

The following are the results from Figure 6 of the original paper:

Figure 7: Performance comparison between DDCFR and GreedyWeights on extensive-form games.
该图像是图表，展示了DDCFR与GreedyWeights在四种扩展形式游戏（Battleship-2, Liar's Dice-4, Kuhn Poker, Leduc Poker）中的表现对比。横坐标为迭代次数，纵坐标为可利用性（Exploitability）。可以看到DDCFR在迭代过程中可利用性逐渐降低，表现优于GreedyWeights。

Figure 6: All iterations of DDCFR and discounting CFR variants on eight testing games. Similarly, this figure shows the full exploitability curves for the testing games, further reinforcing DDCFR's superior convergence behavior.

6.1.4. Comparison with Greedy Weights

The paper also compares DDCFR with Greedy Weights, an algorithm primarily designed for normal-form games.

The following are the results from Figure 7 of the original paper:

$Figure 8: Performance comparison between DDCFR and $\\mathrm { P C F R + }$ on four testing games.$
该图像是图表，展示了DDCFR与PCFR+在四个测试游戏（Leduc Poker、Big Leduc Poker、Goofspiel-4和Liar's Dice-4）中的表现比较。图中显示了随着迭代次数的增加，两个算法的可利用性（Exploitability）变化趋势。

Figure 7: Performance comparison between DDCFR and GreedyWeights on extensive-form games.

Analysis of Figure 7:

DDCFR consistently converges faster to a lower exploitability across all four extensive-form games (Battleship-2, Liar's Dice-4, Kuhn Poker, Leduc Poker).
This suggests that while Greedy Weights is promising for normal-form games, its approach might not be as effective in the more complex extensive-form games with a large number of information sets, where DDCFR's explicit design for this domain yields better results.

6.1.5. Comparison and Combination with Predictive CFR+ (PCFR+)

The DDCFR framework is designed to be general and can be applied to other CFR variants. The paper illustrates this by applying it to $PCFR+$ .

The following are the results from Figure 8 of the original paper:

$Figure 9: Performance comparison between DPCFR $^ +$ and $\\mathrm { P C F R + }$ on four testing games.$
该图像是一个图表，展示了DPCFR $^+$ 和 PCFR+ 在四个测试游戏中的性能比较。图中标示了每次迭代的可利用性（Exploitability）变化，数据点分别来自Leduc Poker、Big Leduc Poker、Goofspiel-4 和 Liar's Dice-4。

Figure 8: Performance comparison between DDCFR and $\\mathrm { P C F R + }$ on four testing games.

Analysis of Figure 8:

$PCFR+$ is faster on non-poker games (Goofspiel-4, Liar's Dice-4), while DDCFR (built on DCFR) outperforms $PCFR+$ on poker games (Leduc Poker, Big Leduc Poker). This aligns with findings from the $PCFR+$ paper itself.
However, DDCFR significantly narrows the gap with $PCFR+$ on non-poker games and further amplifies its advantage on poker games, showing the benefit of its dynamic scheme.

6.1.6. Dynamic Predictive CFR+ (DPCFR+)

Applying the DDCFR framework to $PCFR+$ (which uses a fixed quadratic discounting scheme) results in Dynamic Predictive CFR+ (DPCFR+).

The following are the results from Figure 9 of the original paper:

Figure 10: Performance comparison between PDDCFR and PDCFR on four testing games.
该图像是一个比较图表，展示了在四个测试游戏（Leduc Poker、Big Leduc Poker、Goofspiel-4、Liar's Dice-4）中，PDDCFR与PDCFR算法在迭代过程中的可利用性（Exploitability）变化情况。图中可以看到，PDDCFR在多个迭代次数下表现出更低的可利用性。

Figure 9: Performance comparison between DPCFR $^ +$ and $\\mathrm { P C F R + }$ on four testing games.

Analysis of Figure 9:

$DPCFR+$ further unlocks the potential of $PCFR+$ on non-poker games (Goofspiel-4, Liar's Dice-4), achieving lower exploitability.
Crucially, it does so without sacrificing its performance on poker games (Leduc Poker, Big Leduc Poker). This demonstrates the plug-and-play nature and general applicability of the dynamic discounting framework as a performance booster.

6.1.7. Predictive Dynamic Discounted CFR (PDDCFR)

The framework is also applied to PDCFR (a combination of $PCFR+$ and DCFR), resulting in Predictive Dynamic Discounted CFR (PDDCFR).

The following are the results from Figure 10 of the original paper:

Figure 2: Comparison of DDCFR against two discounting CFR variants on four training games.
该图像是一个图表，展示了DDCFR与两种折扣CFR变体在四个训练游戏中的可剥离性表现。图中比较了CFR+、DCFR和DDCFR在不同迭代次数下的结果，包括小矩阵、Kuhn扑克牌、Goofspiel-3和Liar's Dice-3。

Figure 10: Performance comparison between PDDCFR and PDCFR on four testing games.

Analysis of Figure 10:

PDDCFR significantly enhances PDCFR's performance across both non-poker and poker games (Leduc Poker, Big Leduc Poker, Goofspiel-4, Liar's Dice-4).
This again underscores the generality of the dynamic discounting framework to improve various CFR base algorithms.

6.1.8. Visualization of Learned Discounting Scheme

To understand the behavior of the learned policy $\pi_{\pmb{\theta}}$ , its actions ( $\alpha_t, \beta_t, \gamma_t, \tau_t$ ) are visualized during the iteration process.

The following are the results from Figure 4 of the original paper:

Figure 5: All iterations of DDCFR and discounting CFR variants on four training games.
该图像是一个图表，展示了在四个训练游戏上，DCFR和不同折扣CFR变体的所有迭代结果。图中横轴为迭代次数，纵轴为可利用性，分别显示了CFR+、DCFR和DDCFR在小矩阵、Kuhn扑克、Goofspiel-3和Liar's Dice-3中的性能表现。

Figure 4: The learned discounting scheme behaves differently in various games yet exhibits a similar trend. Compared with DCFR's fixed discounting scheme (we can view DCFR as a special case of DDCFR, where $\\alpha _ { 1 } = 1 . 5 , \\beta _ { 1 } = 0 , \\gamma _ { 1 } = 2$ , and the duration $\\tau _ { 1 } = \\infty$ ), it is more aggressive in the earlier iterations and becomes more moderate as the iteration progresses.

Analysis of Figure 4:

Dynamic Adjustment: The actions ( $\alpha_t, \beta_t, \gamma_t, \tau_t$ ) vary across different games and change dynamically over iterations, confirming the adaptive nature of the learned scheme.
General Trends: Despite game-specific variations, a similar overall trend is observed:
- $\alpha_t$ (positive regret discount) generally increases with iterations. Smaller $\alpha_t$ in earlier iterations indicates more aggressive discounting of positive regrets. As iterations progress, $\alpha_t$ increases, making the discounting more moderate.
- $\beta_t$ (negative regret discount) generally decreases with iterations. The scheme consistently uses more aggressive strategies when discounting negative cumulative regrets (smaller $\beta_t$ means stronger discounting). This is beneficial for rapidly reusing promising actions.
- $\gamma_t$ (average strategy discount) generally decreases with iterations. A larger $\gamma_t$ in earlier iterations signifies more aggressive discounting of the cumulative strategy, giving more weight to recent strategies. As iterations progress, $\gamma_t$ decreases, leading to more moderate averaging.
- $\tau_t$ (duration of weights) suggests that discounting weights should be adjusted more frequently in earlier iterations (smaller $\tau_t$ ), and less frequently later on (larger $\tau_t$ ).
Aggressive Early, Moderate Late: Compared to DCFR's fixed scheme ( $\alpha=1.5, \beta=0, \gamma=2$ ), the learned scheme is generally more aggressive in earlier iterations (smaller $\alpha$ , larger $\gamma$ ). This might be because optimal strategies change frequently in early iterations, benefiting from heavily discounting older information. As learning progresses and strategies stabilize, the scheme becomes more moderate, aiding convergence.

6.2. Ablation Studies / Parameter Analysis

Ablation studies were conducted to assess the effectiveness of DDCFR's various components. Results are reported in Table 1, showing the exploitability on three testing games (Leduc Poker, HUNL Subgame-3, HUNL Subgame-4).

The following are the results from Table 1 of the original paper:

	DDCFR-1	DDCFR-2	DDCFR-3	DDCFR-it	DDCFR-ex	DDCFR-dis	DDCFR-rl	DDCFR
LeducPoker	1.566e-4	1.490e-4	1.213e-4	1.249e-4	1.289e-4	1.156e-4	1.229e-4	1.198e-4
Subgame-3	2.533e-3	2.481e-3	2.393e-3	2.292e-3	5.320e-3	5.795e-3	4.987e-3	1.963e-3
Subgame-4	2.517e-3	2.175e-3	2.122e-3	2.714e-3	3.804e-3	5.803e-3	5.198e-3	2.005e-3

Analysis of Table 1:

Number of Training Games (DDCFR-x):
- DDCFR-1 (trained on one game) performs relatively poorly, indicating overfitting.
- DDCFR-2 and DDCFR-3 show improved results, as training on multiple games necessitates balancing performance and enhances generalization.
- The full DDCFR (trained on four games) achieves the best performance. Notably, the addition of "Small Matrix" to the training set significantly improves performance on "HUNL Subgame-3" and "HUNL Subgame-4," possibly because both games involve high-loss actions, a property captured by Small Matrix. This confirms the importance of diverse training games for generalization.
State Representation (DDCFR-it, DDCFR-ex):
- DDCFR-it (using only normalized iteration $\hat{t}$ ) performs better than DDCFR-ex (using only normalized exploitability $\hat{E}_{t-1}^G$ ). This suggests that the iteration number provides more critical decision-making information to the agent than exploitability alone.
- However, both DDCFR-it and DDCFR-ex perform worse than the full DDCFR, indicating that normalized iteration and normalized exploitability are complementary features, and using both together provides the best performance.
Action Space Design (DDCFR-dis):
- DDCFR-dis uses a discrete action space for $\alpha, \beta, \gamma$ (discretizing their ranges to integers), while DDCFR uses a continuous action space.
- DDCFR-dis performs significantly worse than DDCFR. This indicates that continuous actions enable the learned dynamic discounting scheme to exercise more fine-grained control over the discounting weights, leading to better performance.
Optimization Algorithm (DDCFR-rl):
- DDCFR-rl uses PPO (Proximal Policy Optimization), a common Reinforcement Learning (RL) algorithm, to optimize the policy.
- DDCFR-rl underperforms DDCFR (which uses ES) across all testing games. This validates the authors' choice of ES due to the challenges of sparse rewards, long time horizons, and multi-task learning, which typical RL algorithms like PPO struggle with in this context.

6.2.1. Running Time Comparison

The paper provides a comparison of running times between DCFR and DDCFR to quantify the computational overhead introduced by the dynamic discounting.

The following are the results from Table 2 of the original paper:

Game	DCFR_time	DDCFR_time	(DDCFR_time -DCFR_time) / DCFR_time
Battleship-2	0:14:54.558	0:14:55.814	0.140%
Battleship-3	11:10:46.895	11:12:05.156	0.194%
Goofspiel-4	00:01:03.259	00:01:03.518	0.409%
Liar's Dice-4	0:06:57.896	0:06:58.577	0.163%
Leduc Poker	0:08:13.140	0:08:13.921	0.158%
Big Leduc Poker	14:55:48.347	14:58:07.357	0.259%
HUNL Subgame-3	0:02:54.634	0:02:55.024	0.223%
HUNL Subgame-4	0:02:14.379	0:02:14.648	0.200%

Analysis of Table 2:

The table shows that DDCFR introduces a very small increase in running time compared to DCFR. The average increase is only $0.22\%$ .
This negligible overhead confirms the claim that feature calculations (exploitability) and network inference for the policy network add minimal burden to the overall CFR iteration time, especially with parallel execution of exploitability calculation. The bulk of the computational cost remains within the CFR tree traversal itself.

6.2.2. Logarithmic Normalization for Exploitability

The paper justifies the use of logarithmic normalization for exploitability in the state representation.

The following are the results from Table 3 of the original paper:

t	1	10	20	100	500	1000
Et	4.583e-1	3.302e-2	1.599e-2	1.767e-3	1.699e-4	4.021e-5
(log Et − log Emin)/(log E1 − log Emin)	1.000	0.902	0.875	0.793	0.706	0.652
(√Et − √Emin)/(√E1 − √Emin)	1.000	0.268	0.187	0.062	0.019	0.009
(Et − Emin)/(E1 − Emin)	1.000	0.072	0.034	0.004	0.000	0.000

Analysis of Table 3:

Raw Exploitability ( $E_t$ ): The raw exploitability values decrease very rapidly initially and then slow down significantly. A linear scale would compress most of the meaningful changes into the very early iterations.
Logarithmic Normalization: (log Et − log Emin)/(log E1 − log Emin) results in a more evenly distributed range of values (e.g., from 1.000 down to 0.652). This provides the neural network agent with more discriminative information across different stages of CFR convergence.
Other Normalizations: Square root normalization $(√Et − √Emin)/(√E1 − √Emin)$ and linear normalization (Et − Emin)/(E1 − Emin) compress values much more drastically, especially in later iterations, where they quickly approach zero. This would make it harder for the agent to differentiate between states. The logarithmic normalization effectively spreads out the exploitability values, making it easier for the policy network to learn meaningful distinctions and make better decisions throughout the CFR process.

6.3. Deterministic Nature of DDCFR's Iteration Process

The paper clarifies that once a discounting policy is learned using ES (which is a stochastic training process), the subsequent application of DDCFR using this learned policy is entirely deterministic. When testing the learned policy on a game, DDCFR always initializes with a uniform random average strategy, and the policy network produces consistent actions for determined inputs. Therefore, each run of DDCFR with the same learned model on the same game will generate an identical exploitability curve.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Dynamic Discounted CFR (DDCFR), a pioneering framework for solving imperfect-information games by employing an automatically-learned, dynamic discounting scheme. By formalizing the CFR iteration process as a Markov Decision Process (MDP) and framing the discounting scheme learning as a policy optimization problem, DDCFR moves beyond the limitations of fixed, manually-specified discounting rules in previous CFR variants. The use of Evolution Strategies (ES) for training the policy, combined with a game-agnostic state representation, enables the learned scheme to dynamically adapt discounting weights at runtime. Extensive experiments confirm that DDCFR achieves faster convergence and improved performance, demonstrating strong generalization ability across both training and unseen testing games, including complex ones like HUNL subgames. The framework's general applicability is further showcased by its ability to enhance other state-of-the-art CFR variants like $PCFR+$ and PDCFR.

7.2. Limitations & Future Work

The paper does not explicitly list a "Limitations" section, but some can be inferred:

Computational Cost of Training: While the inference time for the learned policy is negligible, the ES-based training process itself, requiring evaluation across multiple games and numerous perturbed parameters for many epochs, can be computationally intensive (e.g., 24 hours on 200 CPU cores). However, the authors argue this cost is amortized over the reusability of the trained policy across many games.
Scope: The current work focuses on two-player zero-sum IIGs. Extending it to multiplayer or general-sum games could introduce new complexities in reward definition, state representation, and policy learning.
Exploitability Calculation Overhead: While the paper states exploitability calculation is parallelized to avoid wall-clock time overhead during CFR execution, the exploitability calculation itself (which requires best response computation) is often one of the most expensive parts of evaluating CFR algorithms, especially in large games. This cost is inherent to the reward function and state representation.

Potential future research directions implied or suggested by the paper include:
Extending the Framework: Applying the dynamic discounting framework to a wider range of CFR variants or even other iterative game-solving algorithms.
More Complex State Representations: Exploring richer state features that might capture more nuanced aspects of the CFR iteration process or game characteristics, potentially leading to even more sophisticated adaptive policies.
Online Learning: Investigating methods for online adaptation or fine-tuning of the discounting policy without requiring full ES retraining.
Multiplayer/General-Sum Games: Adapting the DDCFR framework to non-zero-sum or multiplayer IIGs, which present distinct challenges in defining equilibrium and regret.

7.3. Personal Insights & Critique

This paper presents a highly insightful and impactful contribution to the field of imperfect-information game solving. The idea of dynamically learning the hyperparameters of CFR algorithms, rather than relying on fixed, manually-tuned values, is a logical and powerful extension of previous work on discounted CFR variants.

Inspirations:

Meta-Learning for Algorithm Design: The approach effectively acts as a meta-learning framework that learns to "tune" the CFR algorithm itself. This concept could be transferred to optimizing hyperparameters or algorithmic choices in other iterative optimization or learning algorithms, especially those with long execution times and complex dynamics.
MDP Formulation for Abstract Processes: Framing the CFR iteration as an MDP is elegant. This shows how complex, non-standard optimization processes can be effectively modeled for RL or ES optimization, even when direct gradient computation is difficult.
Effectiveness of ES for Challenging RL Problems: The successful application of Evolution Strategies to a problem with sparse rewards, long horizons, and multi-task requirements reinforces ES as a strong alternative to gradient-based RL in such scenarios. This could inspire more researchers to consider ES for similar meta-learning or hyperparameter optimization tasks.
Generalizability as a Key Metric: The strong emphasis on generalization and the design choices (game-agnostic state, diverse training set) to achieve it are commendable. This highlights that for algorithms intended for broad application, learning an adaptive policy that generalizes is often more valuable than achieving peak performance on a single, manually-tuned instance.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Interpretability of Policy: While Figure 4 provides some insight into the learned trends of $\alpha_t, \beta_t, \gamma_t$ , the exact "logic" or complex interplay of these parameters as a function of both iteration and exploitability could be hard to fully interpret. Understanding why the policy makes certain choices might require more advanced interpretability techniques.
Sensitivity to Training Game Diversity: The ablation study showed that the number and type of training games significantly impact generalization. This implies that carefully curating a diverse and representative set of training games is crucial for DDCFR's success, which might still involve some level of manual effort or domain expertise. The "Small Matrix" game, designed to simulate high-loss actions, is a good example of how targeted training examples can enhance generalization for specific game characteristics.
Scalability for Extremely Large Games: While the running time overhead is small, and ES is parallelizable, the base CFR algorithm itself (which requires game tree traversals) can become prohibitively expensive for truly enormous games (e.g., full-scale no-limit Texas Hold'em). DDCFR improves the convergence rate, but the fundamental scalability limits of tree-based CFR still apply.
Alternative Reward Functions: The current reward function focuses on overall exploitability reduction. Exploring more nuanced reward functions that might provide denser feedback or target specific convergence behaviors (e.g., stability, avoiding oscillations) could be a fruitful area for future work.
Exploring Dynamic $\tau_t$ Further: The discrete choices for $\tau_t$ (1, 2, 5, 10, 20) are relatively limited. A continuous $\tau_t$ (with appropriate regularization) or a more sophisticated adaptive mechanism for deciding when to update the weights could yield further improvements, though it might increase the complexity of the policy and training.

Overall, DDCFR represents a significant step towards more autonomous and efficient game-solving AI, demonstrating the power of meta-learning in optimizing foundational algorithms.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.