Dynamic Discounted Counterfactual Regret Minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants’ performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which eno
TL;DR Summary
This paper introduces Dynamic Discounted CFR (DDCFR), the first framework to dynamically learn discounting for iterations. By formalizing the CFR process as a Markov decision process, DDCFR achieves faster convergence and high performance, significantly improving efficiency in so
Abstract
Hang Xu, Kai Li, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng Counterfactual regret minimization (CFR) is a family of iterative algorithms showing promising results in solving imperfect-information games. Recent novel CFR variants (e.g., CFR+, DCFR) have significantly improved the convergence rate of the vanilla CFR. The key to these CFR variants’ performance is weighting each iteration non-uniformly, i.e., discounting earlier iterations. However, these algorithms use a fixed, manually-specified scheme to weight each iteration, which enormously limits their potential. In this work, we propose Dynamic Discounted CFR (DDCFR), the first equilibrium-finding framework that discounts prior iterations using a dynamic, automatically-learned scheme. We formalize CFR’s iteration process as a carefully designed Markov decision process and transform the discounting scheme learning problem into a policy optimization problem within it. The learned discounting scheme dynamically weights each iteration on the fly using information available at runtime. Extensive experimental results demonstrate that DDCFR’s dynamic discounting scheme has a strong generalization ability and leads to faster convergence with improved performance. The code is available at https://github.com/rpSebastian/DDCFR.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Dynamic Discounted COunterfActuaL RegreT MInimIZatiON (DDCFR)," which introduces a novel framework for learning dynamic discounting schemes in Counterfactual Regret Minimization (CFR) algorithms, aimed at solving imperfect-information games more efficiently.
1.2. Authors
The authors are:
- Hang Xu (Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences)
- Kai Li (Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences)
- Haobo Fu (Tencent AI Lab)
- Qiang Fu (Tencent AI Lab)
- Junliang Xing (Tsinghua University)
- Jian Cheng (Institute of Automation, Chinese Academy of Sciences; School of Future Technology, University of Chinese Academy of Sciences; AiRiA)
1.3. Journal/Conference
The paper was published in the proceedings of a top-tier Artificial Intelligence conference, likely the AAAI Conference on Artificial Intelligence or a similar venue, given the nature of the work and citations to other AAAI papers. Such venues are highly reputable and influential in the AI research community, particularly for topics in game theory, multi-agent systems, and reinforcement learning.
1.4. Publication Year
The publication year is not explicitly stated in the provided text snippet, but based on the references (e.g., Xu et al., 2022 is cited), it is likely 2023 or later.
1.5. Abstract
Counterfactual Regret Minimization (CFR) algorithms are iterative methods widely used to solve imperfect-information games. While recent CFR variants like and DCFR have significantly improved convergence rates by non-uniformly weighting (discounting) earlier iterations, they rely on fixed, manually-specified discounting schemes. This limitation restricts their full potential.
This paper proposes Dynamic Discounted CFR (DDCFR), a novel framework that learns a dynamic, automatically-adjusted discounting scheme. The core methodology involves formalizing the CFR iteration process as a Markov Decision Process (MDP) and transforming the discounting scheme learning into a policy optimization problem within this MDP. The learned scheme dynamically adjusts iteration weights at runtime using available information. Extensive experiments demonstrate that DDCFR's dynamic discounting scheme offers strong generalization capabilities, leading to faster convergence and improved performance across various games, including those not seen during training.
1.6. Original Source Link
The original source link is /files/papers/6913f6f32db53125aacc49fd/paper.pdf. This appears to be a direct link to the PDF file, likely from a conference or institutional repository, indicating it is an officially published paper or a publicly available preprint.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by the paper is the inefficiency and suboptimality of fixed, manually-specified discounting schemes in existing Counterfactual Regret Minimization (CFR) variants for solving imperfect-information games.
This problem is important because imperfect-information games (IIGs) model strategic interactions where players lack complete knowledge of the game state or opponents' private information, which is omnipresent in real-world decision-making. Solving these games is theoretically and practically crucial for developing robust artificial intelligence. CFR and its variants are among the most successful approaches for computing Nash equilibria in IIGs, underpinning breakthroughs in games like poker. However, current CFR variants, despite their advancements (e.g., , DCFR), rely on static discounting rules (e.g., DCFR's fixed hyperparameters).
The specific challenges or gaps in prior research include:
-
Suboptimality: A fixed discounting scheme, manually chosen, cannot be optimal for all games or even across different phases of the same game, as there are limitless theoretical schemes.
-
Lack of Flexibility: Pre-determined schemes cannot adapt to runtime information or changing conditions during the iterative learning process, potentially hindering convergence.
-
Scalability: Manually tuning hyperparameters for diverse
IIGsis impractical and resource-intensive.The paper's entry point and innovative idea is to introduce a
dynamic, automatically-learned discounting schemeforCFRalgorithms. Instead of relying on human intuition or extensive grid searches,DDCFRtransforms the problem of finding optimal discounting weights into apolicy optimization problemwithin aMarkov Decision Process (MDP), allowing the weights to be adjustedon the flybased on the current state of the iteration process.
2.2. Main Contributions / Findings
The paper makes three primary novel contributions:
-
Proposed DDCFR Framework: It introduces
DDCFR, the first equilibrium-finding framework that utilizes a dynamic, automatically-learned scheme to discount prior iterations. This moves beyond fixed, manually-specified discounting rules, offering greater flexibility and potential for improved performance. -
MDP Formalization for Discounting Scheme Learning: The paper formalizes the iterative process of
CFRas a carefully designedMarkov Decision Process (MDP). This transformation allows the problem of learning an effective discounting scheme to be framed as a policy optimization task within theMDP, where an agent observes the state of theCFRprocess and outputs dynamic discounting weights. -
Effective Training Algorithm and Generalization: It designs an effective training algorithm based on
Evolution Strategies (ES)to optimize the discounting policy. The learned scheme demonstrates strong generalization ability, achieving competitive performance on new games not seen during training, significantly outperforming fixed discounting schemes on both training and testing games.The key conclusion is that dynamic, automatically-learned discounting schemes, which adapt to runtime information, lead to faster convergence and improved performance in solving imperfect-information games compared to static, manually-specified schemes. This approach provides a
performance boosterthat can be applied to various existingCFRvariants.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the DDCFR framework, it is crucial to grasp several foundational concepts in game theory and artificial intelligence:
3.1.1. Imperfect-Information Games (IIGs)
An imperfect-information game (IIG) is a type of game where players do not have complete knowledge about the state of the game or the actions taken by other players. This contrasts with perfect-information games (like chess) where all players know everything that has occurred. In IIGs, players often have private information (e.g., hidden cards in poker) and must make decisions under uncertainty. IIGs model many real-world scenarios, making their study highly relevant.
3.1.2. Two-Player Zero-Sum Games
A two-player zero-sum game is a type of game involving two players where one player's gain directly corresponds to the other player's loss, meaning their utilities sum to zero (or a constant). In such games, there is a direct conflict of interest, and finding optimal strategies often involves reaching a balance where neither player can unilaterally improve their outcome. The paper focuses on solving two-player zero-sum IIGs.
3.1.3. Nash Equilibrium
A Nash equilibrium (Nash, 1950) is a stable state in a game where no player can improve their outcome by unilaterally changing their strategy, assuming the other players' strategies remain unchanged. It represents a set of strategies, one for each player, such that each player's strategy is an optimal response to the strategies of the others. The goal of solving IIGs is often to find an (approximate) Nash equilibrium.
3.1.4. Exploitability
Exploitability is a measure of how far a given strategy profile is from a Nash equilibrium. For a strategy profile , the exploitability quantifies the maximum expected utility gain an opponent could achieve by switching to their best response, assuming the other players' strategies remain fixed. In a Nash equilibrium, the exploitability is zero. An -Nash equilibrium is a strategy profile where no player has an exploitability higher than . The paper uses exploitability as the primary metric for convergence and performance. The paper defines exploitability of a strategy profile as:
$
e ( \bar { \sigma } ) = \sum _ { i \in \mathcal { N } } e _ { i } \bar { ( } \sigma _ { i } \bar { ) } / | { \mathcal { N } } |
$
where represents the exploitability of player 's strategy. More specifically, for player , the exploitability of their strategy against other players' strategies is defined as the difference between the maximum utility player could get by playing a best response and the utility they get from playing :
$
e_i(\sigma_i) = u_i(\mathrm{BR}(\sigma_{-i}), \sigma_{-i}) - u_i(\sigma_i, \sigma_{-i})
$
where is the best response strategy for player given .
3.1.5. Extensive-Form Games
Extensive-form games (Osborne & Rubinstein, 1994) are a tree-based formalism used to describe IIGs.
- Game Tree: Represents all possible sequences of actions.
- Histories (): Sequences of actions taken from the start of the game.
Terminal histories() are those where no further actions are available. - Information Sets (): For a player , an information set is a set of histories that player cannot distinguish between. If player is to move at any history in , they cannot tell which specific history in they are at. Therefore, their strategy must be the same for all histories within that information set.
- Strategy (): A probability distribution over available actions for player at an information set . is the probability of taking action .
- Utility Function (): Assigns a payoff to player for each terminal history .
- Reach Probabilities (): The joint probability of reaching a history if all players play according to strategy profile . This can be decomposed into contributions from player () and all other players (). Similarly,
information set reach probabilityis the sum of reach probabilities of histories in . - Counterfactual Value (): The expected utility for player at an information set , assuming player tries to reach it, and other players play according to . The counterfactual value of taking a specific action in is .
3.1.6. Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker (agent). An MDP is defined by:
- States (): The possible situations an agent can be in.
- Actions (): The choices available to the agent in each state.
- Transition Function (): A probability distribution over the next state given the current state and action. In the context of
DDCFR, this describes how theCFRiteration process evolves. - Reward Function (): A value received by the agent after taking an action, indicating the desirability of that action's outcome. The goal of the agent is to maximize its cumulative reward.
3.1.7. Evolution Strategies (ES)
Evolution Strategies (ES) (Wierstra et al., 2014; Salimans et al., 2017) are a class of black-box optimization algorithms inspired by biological evolution. They are used to optimize objective functions without requiring gradient information. Instead, ES works by:
- Sampling: Generating a population of perturbed versions of the model parameters (e.g., by adding Gaussian noise).
- Evaluation: Evaluating the objective function (fitness) for each perturbed parameter set.
- Update: Updating the main model parameters based on the fitness of the perturbed versions, typically by moving towards the direction of higher fitness.
ESis known for its scalability, ease of parallelization, and robustness to sparse or noisy rewards, making it suitable for problems where traditional gradient-based methods are challenging.
3.2. Previous Works
The paper builds upon and differentiates itself from several key CFR variants and related optimization techniques:
3.2.1. Counterfactual Regret Minimization (CFR)
Introduced by Zinkevich et al. (2007), CFR is a fundamental iterative algorithm for finding Nash equilibria in extensive-form IIGs. It works by iteratively traversing the game tree, calculating instantaneous regrets for not choosing certain actions, accumulating these regrets, and then updating strategies using regret-matching (Hart & Mas-Colell, 2000; Cesa-Bianchi & Lugosi, 2006). The time-averaged strategy profile converges to a Nash equilibrium.
- Instantaneous Regret (): The difference between the counterfactual value of taking action in information set and the counterfactual value of playing the current strategy in . $ \boldsymbol { r } _ { i } ^ { t } ( \boldsymbol { I } , \boldsymbol { a } ) = \boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } , \boldsymbol { a } ) - \boldsymbol { v } _ { i } ^ { \sigma ^ { t } } ( \boldsymbol { I } ) $ where is the counterfactual value of action in information set under strategy , and is the counterfactual value of information set under strategy .
- Cumulative Regret (): The sum of instantaneous regrets for an action in an information set over all iterations up to . $ R _ { i } ^ { t } ( I , a ) = \sum _ { k = 1 } ^ { t } r _ { i } ^ { k } ( I , a ) $
- Regret-Matching: The strategy for the next iteration is computed based on positive cumulative regrets: $ \begin{array} { r } { \sigma _ { i } ^ { t + 1 } ( I , a ) = \left{ \begin{array} { l l } { \frac { R _ { i } ^ { t , + } ( I , a ) } { \sum _ { a ^ { \prime } \in A ( I ) } R _ { i } ^ { t , + } ( I , a ^ { \prime } ) } , } & { \mathrm { i f ~ } \sum _ { a ^ { \prime } \in A ( I ) } R _ { i } ^ { t , + } \left( I , a ^ { \prime } \right) > 0 } \ { \frac { 1 } { | A ( I ) | } , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. } \end{array} $ where . This means only actions with positive regret (i.e., actions that would have been better to take) contribute to the future strategy.
- Average Strategy (): The overall strategy that converges to a Nash equilibrium is the average of strategies from all iterations, weighted by their
reach probabilities: $ C _ { i } ^ { t } ( I , a ) = \sum _ { k = 1 } ^ { t } \left( \pi _ { i } ^ { \sigma ^ { k } } ( I ) \sigma _ { i } ^ { k } ( I , a ) \right) , ~ \bar { \sigma } _ { i } ^ { t } ( I , a ) = \frac { C _ { i } ^ { t } ( I , a ) } { \sum _ { a ^ { \prime } \in A ( I ) } C _ { i } ^ { t } ( I , a ^ { \prime } ) } , $ where is theinformation set reach probabilityfor player using strategy .
3.2.2. CFR+
(Tammelin, 2014) significantly improved vanilla CFR's convergence rate. Its key modifications include:
- Linear Discounting: Weights iteration by rather than uniformly, giving more importance to recent iterations.
- Regret Resetting: Sets any action with negative cumulative regret to zero on each iteration, allowing for immediate reuse of promising actions.
- Alternating Updates: Players update their strategies in an alternating fashion.
3.2.3. Discounted CFR (DCFR)
DCFR (Brown & Sandholm, 2019b) is a family of algorithms that generalizes discounting schemes, further accelerating convergence. It discounts both cumulative regrets and the cumulative strategy using three hyperparameters: , , and .
- Discounted Cumulative Regret Update: $ R _ { i } ^ { t } ( I , a ) = \left{ \begin{array} { l l } { R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \alpha } } { ( t - 1 ) ^ { \alpha } + 1 } + r _ { i } ^ { t } ( I , a ) , } & { \mathrm { i f ~ } R _ { i } ^ { t - 1 } ( I , a ) > 0 } \ { R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \beta } } { ( t - 1 ) ^ { \beta } + 1 } + r _ { i } ^ { t } ( I , a ) , } & { \mathrm { o t h e r w i s e } , } \end{array} \right. $ Here, positive cumulative regrets are multiplied by and negative cumulative regrets by before adding the current instantaneous regret .
- Discounted Cumulative Strategy Update:
$
C _ { i } ^ { t } ( I , a ) = C _ { i } ^ { t - 1 } ( I , a ) \left( \frac { t - 1 } { t } \right) ^ { \gamma } + \pi _ { i } ^ { \sigma ^ { t } } ( I ) \sigma _ { i } ^ { t } ( I , a ) .
$
The previous cumulative strategy is discounted by before adding the current strategy contribution.
DCFRempirically uses fixed values like , , and .
3.2.4. Predictive CFR+ (PCFR+) and PDCFR
Predictive CFR+ (Farina et al., 2021) is a state-of-the-art CFR variant based on predictive Blackwell approachability. It also uses a fixed quadratic discounting scheme for the average strategy, similar to the term in DCFR (e.g., with ). PDCFR is a combination of and DCFR, also using fixed discounting.
3.2.5. Greedy Weights
Greedy Weights (Zhang et al., 2022) is a recently proposed algorithm for normal-form games that greedily weighs iterates based on observed regrets. While adaptable to extensive-form games, the paper finds its performance inferior to DDCFR in that domain.
3.3. Technological Evolution
The field of solving imperfect-information games has seen a steady progression in CFR algorithms:
- Vanilla CFR (2007): Established the foundational iterative regret minimization approach with uniform weighting of iterations. Provided theoretical convergence guarantees.
- CFR+ (2014): Introduced crucial practical improvements, notably linear discounting and regret resetting, drastically speeding up convergence and enabling solutions for games like heads-up limit Texas Hold'em. This showed the power of non-uniform weighting.
- Discounted CFR (DCFR) (2019): Systematically investigated discounting schemes, generalizing the concept with hyperparameters. This demonstrated further significant convergence acceleration, but still relied on fixed, manually-tuned parameters.
- Predictive CFR+ (PCFR+) (2021): Incorporated concepts from Blackwell approachability, offering another avenue for speedup, again with fixed discounting rules.
- Dynamic Discounted CFR (DDCFR) (Current Paper): This work represents the next evolutionary step by addressing the primary limitation of prior discounting variants: their fixed, manually-specified nature.
DDCFRintroduces a dynamic, automatically-learned discounting scheme, effectively turning thehyperparameter optimizationproblem into apolicy optimization problemwithin anMDP, thus making theCFRprocess adaptive and potentially more robust and efficient across diverse game types.
3.4. Differentiation Analysis
The core difference and innovation of DDCFR compared to previous CFR variants like and DCFR lie in the dynamic and automatically-learned nature of its discounting scheme.
-
Fixed vs. Dynamic Discounting:
- uses a fixed linear discounting scheme.
DCFRuses a family of fixed discounting schemes determined by manually chosen hyperparameters (). While these parameters allow for different schemes, the chosen values are static throughout the game and across different games.DDCFR(and its framework applied to andPDCFR), in contrast, learns a policy thatdynamically adjuststhe discounting weights (represented by )on the flyat each iteration . This adjustment is based on runtime information about theCFRprocess, such as the current iteration number and exploitability.
-
Manual vs. Automatic Learning:
- Previous variants require
manual specificationorempirical tuningof their discounting parameters (e.g.,DCFR's were empirically found to be best). This process is unscalable, suboptimal, and requires domain expertise for each new game. DDCFRautomatically learnsthis scheme by formulating the problem as apolicy optimizationwithin aMarkov Decision Process (MDP). An agent (neural network) learns to output optimal discounting parameters based on the observed state.
- Previous variants require
-
Generalization vs. Game-Specific Tuning:
- Manually tuned schemes are often optimized for specific games or a limited set of games and may not generalize well to unseen games.
DDCFRis explicitlydesigned for generalization. By training on a diverse set of games and using agame-agnostic state representation, the learned policy can effectively adapt to new games without further modification or tuning.
-
Optimization Approach:
-
Traditional hyperparameter optimization methods (grid search, Bayesian optimization) find the best fixed values.
-
DDCFRusesEvolution Strategies (ES)to train a neural network policy that outputs dynamic weights. This is a novel application ofESto directly learn an adaptiveCFRcomponent.In essence,
DDCFRprovides a meta-learning approach toCFR, where the algorithm itself learns how to best configure its iterative process, rather than relying on fixed rules.
-
4. Methodology
4.1. Principles
The core idea behind Dynamic Discounted CFR (DDCFR) is to move beyond fixed, manually-specified discounting schemes in Counterfactual Regret Minimization (CFR) algorithms by learning a dynamic, adaptive scheme. The theoretical basis is that CFR's iterative process can be framed as a sequential decision-making problem, which is naturally modeled as a Markov Decision Process (MDP). Within this MDP, an intelligent agent (represented by a neural network) can learn a policy to choose optimal discounting weights at each iteration, maximizing the efficiency of convergence to a Nash equilibrium. The intuition is that different phases of the learning process (e.g., early vs. late iterations) or different game characteristics might benefit from different discounting strategies, and a dynamic scheme can adapt to these nuances.
4.2. Core Methodology In-depth (Layer by Layer)
The DDCFR framework encapsulates the CFR iteration process into an environment and models the discounting scheme as an agent interacting with it. This interaction constitutes an MDP.
4.2.1. The DDCFR Framework Overview
The agent (the dynamic discounting scheme) receives the current state of the CFR iteration process and outputs an action, which consists of the discounting weights to be used for a certain number of subsequent CFR iterations. This process repeats until a maximum number of iterations is reached. The agent's ultimate goal is to learn an optimal discounting policy that can dynamically select appropriate weights, thereby minimizing the exploitability of the obtained average strategies and finding an approximate Nash equilibrium promptly.
The DDCFR framework can be visualized as follows, illustrating the overall iterative process:

该图像是示意图,展示了动态折扣反事实遗憾最小化(DDCFR)框架的工作原理。图中包含多个组件,如迭代归一化、状态、动作、以及环境中的博弈 。各部分颜色编码表示了当前策略、瞬时遗憾、累积遗憾更新、策略匹配等关键步骤,以及奖励信号的反馈机制。整体框架表明如何动态调整策略以提高收敛速度和性能。
Figure 1: The DDCFR framework.
4.2.2. MDP Formalization
The interaction process between the agent and the CFR environment for a given game is formalized as a Markov Decision Process (MDP), represented by .
4.2.2.1. The Game
Each game acts as a distinct environment for the MDP. The ultimate objective is to train a discounting policy that not only performs well on the training games but also generalizes effectively to new, unseen games.
4.2.2.2. The State Space
The state represents the information observed by the agent at iteration within game . To ensure the learned policy's generalization ability across different games, the state representation is designed to be game-agnostic. It consists of two key components:
- Normalized Iteration (): This is the ratio of the current iteration to the total number of iterations . It is calculated as . This provides the agent with a sense of the progress within the overall
CFRrun. - Normalized Exploitability (): This is a normalized logarithmic measure of the exploitability of the average strategy from the previous iteration
t-1. It is calculated as: $ \hat { E } _ { t - 1 } ^ { G } = \frac { \log E _ { t - 1 } ^ { G } - \log E _ { \operatorname* { m i n } } } { \log E _ { 1 } ^ { G } - \log E _ { \operatorname* { m i n } } } $ where:- is the exploitability of the average strategy in game at iteration
t-1. - is the exploitability at iteration 1.
- is the minimum reachable exploitability, which is set to to avoid issues with and provide a stable lower bound.
The
logarithmic operationis crucial here because exploitability values inCFRalgorithms often decrease rapidly in early iterations and then slow down significantly. Logarithmic normalization ensures a more even distribution of exploitability values across the state space, which facilitates the neural network agent's learning process.
- is the exploitability of the average strategy in game at iteration
4.2.2.3. The Action Space
The agent's action at iteration is a vector of four numbers: . These parameters are directly used to determine the discounting weights for the subsequent CFR iterations, similar to DCFR (Brown & Sandholm, 2019b).
The discounting weights are calculated as follows:
-
Positive Cumulative Regrets Discounting Factor: The previous positive cumulative regrets are multiplied by before adding the instantaneous regret . This leads to the update rule: $ R _ { i } ^ { t } ( I , a ) = R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \alpha _ { t } } } { ( t - 1 ) ^ { \alpha _ { t } } + 1 } + r _ { i } ^ { t } ( I , a ) \quad \mathrm { i f ~ } R _ { i } ^ { t - 1 } ( I , a ) > 0 $ Here, controls how aggressively positive regrets are discounted. A smaller means stronger discounting of older regrets.
-
Negative Cumulative Regrets Discounting Factor: The previous negative cumulative regrets are multiplied by before adding the instantaneous regret . This leads to the update rule: $ R _ { i } ^ { t } ( I , a ) = R _ { i } ^ { t - 1 } ( I , a ) \frac { ( t - 1 ) ^ { \beta _ { t } } } { ( t - 1 ) ^ { \beta _ { t } } + 1 } + r _ { i } ^ { t } ( I , a ) \quad \mathrm { o t h e r w i s e } $ Here, controls the discounting of negative regrets.
-
Cumulative Strategy Discounting Factor: The previous cumulative strategy is discounted by before adding the current iteration's strategy contribution. This leads to the update rule: $ C _ { i } ^ { t } ( I , a ) = C _ { i } ^ { t - 1 } ( I , a ) \left( \frac { t - 1 } { t } \right) ^ { \gamma _ { t } } + \pi _ { i } ^ { \sigma ^ { t } } ( I ) \sigma _ { i } ^ { t } ( I , a ) . $ Here, controls how aggressively older strategies are discounted in the average strategy calculation. A larger means stronger discounting of older strategies.
-
Duration (): This parameter represents the number of subsequent
CFRiterations for which the currently chosen discounting weights () will be applied. Incorporating allows the policy to decide how frequently to adjust the hyperparameters.The key distinction from
DCFRis thatDDCFR's are dynamically adjusted at runtime based on the state , rather than being fixed values throughout the entire iteration process or across different games.
4.2.2.4. The State Transition
The state transition function describes how the CFR iteration process evolves. When the agent observes the current state and outputs an action , DDCFR uses the calculated discounting weights for iterations. After these iterations, the state then transitions to . This allows the agent to control the granularity of its policy updates.
4.2.2.5. The Reward Function
The reward function evaluates the agent's performance and guides the learning process. DDCFR uses a sparse reward setting, meaning the agent only receives a reward at the very end of the total CFR iterations (). The reward is defined as:
$
\hat { R } ^ { G } = \log E _ { 1 } ^ { G } - \log E _ { T } ^ { G }
$
where:
- is the exploitability of the initial average strategy (at iteration 1) in game .
- is the exploitability of the final average strategy (at the maximum number of iterations ) in game .
The agent's objective is to maximize this reward, which corresponds to achieving the largest reduction in exploitability from the beginning to the end of the
CFRprocess.
4.2.2.6. The Agent (Policy Network)
The agent, which represents the dynamic discounting scheme, is typically implemented as a neural network with parameters . This network, denoted as , takes the current state as input and outputs the action :
$
[ \alpha _ { t } , \beta _ { t } , \gamma _ { t } , \tau _ { t } ] = \hat { a } _ { t } = \pi _ { \pmb { \theta } } \big ( s _ { t } \big )
$
The overall objective for DDCFR is to learn parameters that maximize the average final reward across all training games :
$
f ( \pmb { \theta } ) = \frac { 1 } { | \mathbb { G } | } \sum _ { G \in \mathbb { G } } f ^ { G } ( \pmb { \theta } )
$
where for a specific game .
4.2.3. Theoretical Analysis
The paper provides a theoretical guarantee for DDCFR's convergence to a Nash equilibrium, provided the chosen hyperparameters stay within certain ranges.
Theorem 1. Assume that conduct DDCFR iterations in a two-player zero-sum game. If DDCFR selects hyperparameters as follows: for and for , , , then the weighted average strategy converges to a Nash equilibrium.
Explanation:
- This theorem ensures that
DDCFR, even with dynamic values, maintains the theoretical convergence properties characteristic ofCFRalgorithms. - The conditions on (switching from
[0, 5]to[1, 5]halfway through iterations) and the ranges for and are crucial for the proof, which relies on bounding cumulative regrets and applying existing lemmas fromDCFRandCFRliterature. - The convergence rate is , which is typical for regret minimization algorithms, scaled by game-specific factors like the number of information sets (), maximum actions (), and utility range (). This indicates that the algorithm will approach a Nash equilibrium, and the speed of approach depends on the square root of the number of iterations.
4.2.4. Optimization Through Evolution Strategies
Training the discounting policy presents several challenges:
-
Sparse Reward: The agent only receives a reward at the very end of the
CFRiteration process, making it difficult for standardReinforcement Learning (RL)algorithms to learn credit assignment over long horizons. -
Long Time Horizons:
CFRoften requires thousands of iterations, leading to very long episodes in theMDP. -
Multi-task Learning: The agent needs to generalize across various games (
multi-task learning), which can be difficult forRLmethods sensitive to hyperparameter tuning.To overcome these challenges, the authors choose
Evolution Strategies (ES)(Salimans et al., 2017) instead of traditionalRLalgorithms likeDQNorPPO.ESis well-suited because it is: -
Black-box Optimization: It does not require gradient information, making it robust to sparse or noisy rewards.
-
Scalable: Easy to parallelize across many CPU cores.
-
Robust: Tolerant of long time horizons and less sensitive to hyperparameters.
The
ES-based training algorithm aims to estimate the gradient of the objective function (average reward across games) by applying Gaussian perturbations to the network parameters . The gradient estimation is given by: $ \nabla _ { \pmb { \theta } } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } \bar { f } ( \pmb { \theta } + \delta \epsilon ) = \frac { 1 } { \delta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } f ( \pmb { \theta } + \delta \epsilon ) \epsilon $ where: -
are the agent's parameters.
-
is the noise standard deviation, controlling the magnitude of perturbations.
-
is a random vector sampled from a standard Gaussian distribution.
-
is the objective function (average reward) evaluated with perturbed parameters.
This gradient is approximated by sampling a population of perturbed parameters, evaluating their performance, and then using these evaluations to update the original parameters via stochastic gradient ascent.
The complete training procedure for DDCFR's discounting policy is outlined in Algorithm 1:
Algorithm 1: DDCFR's training procedure.
Input: Training games , epochs , learning rate lr, population size , standard deviation
1 Randomly initialize the agent's parameters ;
2 for to do
3 Initialize population with an empty list;
4 for do
5 Sample ;
6 Add to the population ;
7 foreach do
8 foreach do
9 Compute , (Algorithm 2);
10
11 Compute using Eqn. 5;
I2
Output: The trained agent
The calculation process of (the reward for a single game) is detailed in Algorithm 2:
Algorithm 2: The calculation process of .
Input: Training game , agent , total iterations
1 ;
2 while do
3 ;
4 Use to calculate discounting weights and perform numbers of CFR iterations;
5 ;
6 Compute exploitability ; Output:
Three acceleration techniques are employed to improve training efficiency:
- Antithetic Estimator: This technique reduces variance in the gradient estimation by sampling perturbations in pairs ( and ). The gradient is calculated using the symmetric difference: $ \nabla _ { \theta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } f \big ( \theta + \delta \epsilon \big ) = \frac { 1 } { 2 \delta } \mathbb { E } _ { \epsilon \sim \mathcal { N } ( 0 , I ) } \left[ f \big ( \theta + \delta \epsilon \big ) - f \big ( \theta - \delta \epsilon \big ) \right] \dot { \epsilon } $ This provides an unbiased estimator with lower variance compared to using only positive perturbations.
- Fitness Shaping: A rank transformation is applied to the objective function values before computing the gradient. For a parameter with the -th largest value in a population of size , the shaped result is: $ f ^ { \prime } ( \theta _ { k } ) = \frac { \operatorname* { m a x } \left( 0 , \log \left( \frac { N } { 2 } + 1 \right) - \log ( k ) \right) } { \sum _ { j = 1 } ^ { P } \operatorname* { m a x } \left( 0 , \log \left( \frac { N } { 2 } + 1 \right) - \log ( j ) \right) } - \frac { 1 } { N } . $ This approach makes the gradient dependent on the relative ordering of fitness values rather than their absolute magnitudes, increasing robustness to variations in reward scales.
- Parallel Evaluation: The performance of each perturbed parameter set is evaluated under each training game in parallel across multiple CPU cores, significantly speeding up the overall training process.
4.2.5. Additional Computation Cost Analysis
DDCFR introduces minor additional computation costs compared to DCFR, primarily in three areas:
- Feature Calculations: The state representation requires the current iteration number (effortlessly obtainable) and the normalized exploitability. The calculation of exploitability can be performed
in parallelwith the instantaneous regret calculation. This is possible because exploitability depends on theaverage strategy() from the previous iteration, while instantaneous regret depends on thecurrent strategy(). By executing these two computations concurrently using multiprocessing, there is effectively no additional wall-clock time overhead. - Network Inference: The time overhead for the policy network to infer the discounting weights is negligible compared to the total
CFRiteration time. Experimental results in Appendix D showDDCFRonly incurs an average increase of in running time compared toDCFR. - Policy Training: The training cost for the discounting policy is amortized. Once trained, the policy can be applied directly to numerous new games without further training or modification, making the initial training effort worthwhile over a wide range of applications.
5. Experimental Setup
5.1. Datasets
The experiments use a selection of commonly used imperfect-information games (IIGs) from the research community. These games vary in complexity and size, providing a diverse testbed for evaluating DDCFR's performance and generalization ability.
Training Games:
The training games are chosen to be relatively small to speed up the training process while still exhibiting typical characteristics found in larger IIGs.
- Kuhn Poker (Kuhn, 1950): A simplified poker game with a 3-card deck (J, Q, K), one bet chance per player.
- Goofspiel-3 (Ross, 1971): A bidding card game with 3 cards, where players bid on revealed point cards. This is an imperfect information variant where players only know if they won/lost the bid, not the opponent's card.
- Liar's Dice-3 (Lisy et al., 2015): A dice game where players roll a 3-sided die secretly and bid on the dice outcomes.
- Small Matrix: A custom game where player 1 has five actions (rock, paper, scissors, A1, A2), and player 2 has three (rock, paper, scissors). It's a modified rock-paper-scissors where A1/A2 actions by player 1 result in large losses, simulating situations with high-loss actions.
Testing Games:
Eight relatively large and challenging IIGs are used to evaluate the generalization performance of the learned discounting scheme on unseen games.
-
Battleship-2 (Farina et al., 2019): A board game where players secretly place a ship on a grid and take turns firing.
-
Battleship-3 (Farina et al., 2019): A larger version of Battleship with a grid.
-
Goofspiel-4 (Ross, 1971): A larger version of Goofspiel with 4 cards.
-
Liar's Dice-4 (Lisy et al., 2015): A larger version of Liar's Dice with 4-sided dice.
-
Leduc Poker (Southey et al., 2005): A larger poker game with a 6-card deck (two suits, three ranks) and two betting rounds, one public card revealed after the first round.
-
Big Leduc Poker: A more complex version of Leduc Poker with a 24-card deck (twelve ranks) and allowing up to six raises per round.
-
HUNL Subgame-3 (Brown & Sandholm, 2018): A subgame of Heads-Up No-Limit Texas Hold'em (HUNL) starting at the final betting round with $500 in the pot.
-
HUNL Subgame-4 (Brown & Sandholm, 2018): Another HUNL subgame starting at the final betting round with $3,750 in the pot.
The choice of these games, both for training and testing, ensures a strong diversity in rules, scales, and characteristics, which is crucial for testing the generalization ability of the learned policy.
The sizes of the games used are reported in Table 4 of the original paper: The following are the results from Table 4 of the original paper:
| Game | #Histories | #Infosets | #Terminal histories | Depth | Max size of infosets |
| Small Matrix | 21 | 2 | 15 | 3 | 5 |
| Kuhn Poker | 58 | 12 | 30 | 6 | 2 |
| Goofspiel-3 | 67 | 16 | 36 | 5 | 4 |
| Liar's Dice-3 | 1,147 | 192 | 567 | 10 | 3 |
| Battleship-2 | 10,069 | 3,286 | 5,568 | 9 | 4 |
| Battleship-3 | 732,607 | 81,027 | 552,132 | 9 | 7 |
| Goofspiel-4 | 1,077 | 162 | 576 | 7 | 14 |
| Liar's Dice-4 | 8,181 | 1024 | 4,080 | 12 | 4 |
| Leduc Poker | 9,457 | 936 | 5,520 | 12 | 5 |
| Big Leduc Poker | 6,178,561 | 100,800 | 3,953,424 | 20 | 23 |
| HUNL Subgame-3 | 398,112,843 | 69,184 | 261,126,360 | 10 | 1,980 |
| HUNL Subgame-4 | 244,005,483 | 43,240 | 158,388,120 | 8 | 1,980 |
5.2. Evaluation Metrics
The primary evaluation metric used in the paper is exploitability.
5.2.1. Exploitability ()
- Conceptual Definition: Exploitability quantifies how far a given strategy profile (a set of strategies for all players) is from a
Nash equilibrium. A strategy profile with zero exploitability is a Nash equilibrium, meaning no player can improve their outcome by unilaterally changing their strategy. A lower exploitability value indicates a closer approximation to a Nash equilibrium. For two-player zero-sum games, exploitability is often defined as twice the maximum utility one player can gain by playing a best response against the other player's fixed strategy. The paper defines the exploitability of a strategy profile as the average exploitability across all players. It directly measures theapproximation errorto the Nash equilibrium. - Mathematical Formula: The paper defines the exploitability of a strategy profile as: $ e ( \bar { \sigma } ) = \frac { 1 } { | \mathcal { N } | } \sum _ { i \in \mathcal { N } } e _ { i } ( \bar { \sigma } _ { i } ) $ where is the exploitability of player 's strategy. For a single player , the exploitability of their strategy against the opponent's strategy is: $ e _ { i } ( \bar { \sigma } _ { i } ) = u _ { i } ( \mathrm { B R } ( \bar { \sigma } _ { - i } ) , \bar { \sigma } _ { - i } ) - u _ { i } ( \bar { \sigma } _ { i } , \bar { \sigma } _ { - i } ) $
- Symbol Explanation:
-
: The exploitability of the strategy profile .
-
: The number of players in the game (for two-player games, ).
-
: The exploitability of player 's strategy .
-
: The expected utility function for player .
-
: The best response strategy for player against the combined strategies of all other players . This is the strategy that maximizes player 's expected utility given .
-
: Player 's current average strategy.
-
: The average strategies of all players except player .
The reward function for
DDCFR's training also relies on exploitability, specifically the reduction in logarithmic exploitability: .
-
5.3. Baselines
The paper compares DDCFR against several strong existing CFR variants:
- CFR+ (Tammelin, 2014): A milestone
CFRvariant known for its linear discounting scheme, regret resetting, and alternating updates, significantly faster than vanillaCFR. - DCFR (Brown & Sandholm, 2019b): A family of algorithms that systematically discounts cumulative regrets and strategies using fixed hyperparameters (), considered one of the most efficient equilibrium-finding algorithms.
- Predictive CFR+ (PCFR+) (Farina et al., 2021): A state-of-the-art
CFRvariant based on predictive Blackwell approachability, which uses a fixed quadratic discounting scheme. - PDCFR: An algorithm that combines and
DCFR, also using fixed discounting schemes. - Greedy Weights (Zhang et al., 2022): A dynamic discounted regret minimization algorithm initially designed for normal-form games, adapted for extensive-form games for comparison.
5.4. Training Details
- Noise Standard Deviation (): Fixed at
0.5forEvolution Strategies (ES). - Population Size (): Set to
100forES. - Parallelization: The evaluation of perturbed parameters is distributed across
200CPU cores to speed up training. - Action Space Ranges:
- : Continuous values in
[0, 5]. - : Continuous values in . These ranges are selected based on Theorem 1 for theoretical convergence guarantees.
- : Discrete values from the set , representing the number of iterations to apply the chosen discounting weights.
- : Continuous values in
- Policy Network Architecture (): A neural network consisting of three fully-connected layers. Each layer has
64units and usesELUactivation functions. The network takes the current state () as input and outputs values for , and a probability distribution over the discrete values of . - Optimizer:
Adamoptimizer with a learning rate (lr) of0.01. - Training Epochs (): The agent is trained for
1000epochs. - Training Duration: Approximately
24hours. - CFR Iterations (): The total number of
CFRiterations for each game within an epoch is set to1000, which is typically sufficient for convergence to small exploitability values. - CFR Variant Techniques: All
CFRvariants used in the experiments (includingDDCFRand baselines) incorporate thealternating-updates technique, where players update their strategies one after another.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that DDCFR consistently achieves faster convergence and lower exploitability compared to and DCFR, especially on testing games, highlighting its strong generalization ability.
6.1.1. Performance on Training Games
The performance comparison on training games (Small Matrix, Kuhn Poker, Goofspiel-3, Liar's Dice-3) shows that DDCFR converges faster to a lower exploitability. This is expected as these games were part of the optimization objective.
The following are the results from Figure 2 of the original paper:

该图像是图表,展示了DDCFR与两个折扣CFR变体在八个测试游戏中的可利用性(exploitability)对比。图中包含不同算法在迭代次数(Iterations)变化下的表现,强调了DDCFR在速度和性能上的优势。
Figure 2: Comparison of DDCFR against two discounting CFR variants on four training games.
Analysis of Figure 2:
- Small Matrix:
DDCFRshows the steepest decline in exploitability, reaching a significantly lower value than andDCFRby 1000 iterations. - Kuhn Poker: All three algorithms perform well, but
DDCFRstill achieves slightly better exploitability at the later iterations. - Goofspiel-3:
DDCFRclearly outperforms andDCFR, showing a faster convergence trajectory and a lower final exploitability. - Liar's Dice-3: Similar to Goofspiel-3,
DDCFRdemonstrates superior performance, indicating its learned scheme is effective even in more complex training environments. In summary,DDCFRsuccessfully optimizes its performance on the training games, validating the effectiveness of theES-based learning process.
6.1.2. Performance on Testing Games
A more crucial validation of DDCFR is its performance on unseen testing games, demonstrating its generalization ability. DDCFR maintains competitive performance and often outperforms and DCFR.
The following are the results from Figure 3 of the original paper:

该图像是图表,展示了动态折扣CFR (DDCFR) 中不同参数 、、 和 的值随迭代次数变化的趋势。各条线代表不同游戏(如“小矩阵”、“谎言骰子3”、“战舰3”、“大Leduc扑克”以及“所有游戏”)中的表现。与固定折扣方案DCFR相比,DDCFR在早期迭代时表现出更激进的折扣策略,随着迭代的进行变得更加温和。
Figure 3: Comparison of DDCFR against two discounting CFR variants on eight testing games.
Analysis of Figure 3:
- Battleship-2, Battleship-3, Goofspiel-4, Liar's Dice-4, Leduc Poker, Big Leduc Poker, HUNL Subgame-3, HUNL Subgame-4: Across all eight testing games,
DDCFRconsistently achieves lower exploitability than andDCFRin the later iterations (500-1000). - Significant Improvement: The paper states that
DDCFRyields an average reduction in exploitability of compared toDCFRon these testing games. This is a remarkable improvement givenDCFR's already strong performance. - Generalization: The consistent outperformance on unseen games strongly supports the claim that
DDCFR's dynamic discounting scheme has a robust generalization ability. This is attributed to the careful design of the training games, the game-agnostic state space, and the scalableEStraining algorithm.
6.1.3. Overall Iteration Performance
For completeness, the paper includes figures showing all iterations (1 to 1000) for training and testing games in the appendices.
The following are the results from Figure 5 of the original paper:

该图像是一个展示不同CFR算法在八个测试游戏上的可利用性(Exploitability)变化的图表。图中展示了CFR+、DCFR和DDCFR的迭代过程,横轴为迭代次数,纵轴为可利用性。DDCFR的动态折扣方案显著改善了收敛速度。
Figure 5: All iterations of DDCFR and discounting CFR variants on four training games. This figure provides the full exploitability curves for the training games, confirming that the performance differences are visible across the entire iteration range, not just the later stages.
The following are the results from Figure 6 of the original paper:

该图像是图表,展示了DDCFR与GreedyWeights在四种扩展形式游戏(Battleship-2, Liar's Dice-4, Kuhn Poker, Leduc Poker)中的表现对比。横坐标为迭代次数,纵坐标为可利用性(Exploitability)。可以看到DDCFR在迭代过程中可利用性逐渐降低,表现优于GreedyWeights。
Figure 6: All iterations of DDCFR and discounting CFR variants on eight testing games.
Similarly, this figure shows the full exploitability curves for the testing games, further reinforcing DDCFR's superior convergence behavior.
6.1.4. Comparison with Greedy Weights
The paper also compares DDCFR with Greedy Weights, an algorithm primarily designed for normal-form games.
The following are the results from Figure 7 of the original paper:

该图像是图表,展示了DDCFR与PCFR+在四个测试游戏(Leduc Poker、Big Leduc Poker、Goofspiel-4和Liar's Dice-4)中的表现比较。图中显示了随着迭代次数的增加,两个算法的可利用性(Exploitability)变化趋势。
Figure 7: Performance comparison between DDCFR and GreedyWeights on extensive-form games.
Analysis of Figure 7:
DDCFRconsistently converges faster to a lower exploitability across all four extensive-form games (Battleship-2, Liar's Dice-4, Kuhn Poker, Leduc Poker).- This suggests that while
Greedy Weightsis promising for normal-form games, its approach might not be as effective in the more complexextensive-form gameswith a large number of information sets, whereDDCFR's explicit design for this domain yields better results.
6.1.5. Comparison and Combination with Predictive CFR+ (PCFR+)
The DDCFR framework is designed to be general and can be applied to other CFR variants. The paper illustrates this by applying it to .
The following are the results from Figure 8 of the original paper:

该图像是一个图表,展示了DPCFR 和 PCFR+ 在四个测试游戏中的性能比较。图中标示了每次迭代的可利用性(Exploitability)变化,数据点分别来自Leduc Poker、Big Leduc Poker、Goofspiel-4 和 Liar's Dice-4。
Figure 8: Performance comparison between DDCFR and on four testing games.
Analysis of Figure 8:
- is faster on non-poker games (Goofspiel-4, Liar's Dice-4), while
DDCFR(built onDCFR) outperforms on poker games (Leduc Poker, Big Leduc Poker). This aligns with findings from the paper itself. - However,
DDCFRsignificantly narrows the gap with on non-poker games and further amplifies its advantage on poker games, showing the benefit of its dynamic scheme.
6.1.6. Dynamic Predictive CFR+ (DPCFR+)
Applying the DDCFR framework to (which uses a fixed quadratic discounting scheme) results in Dynamic Predictive CFR+ (DPCFR+).
The following are the results from Figure 9 of the original paper:

该图像是一个比较图表,展示了在四个测试游戏(Leduc Poker、Big Leduc Poker、Goofspiel-4、Liar's Dice-4)中,PDDCFR与PDCFR算法在迭代过程中的可利用性(Exploitability)变化情况。图中可以看到,PDDCFR在多个迭代次数下表现出更低的可利用性。
Figure 9: Performance comparison between DPCFR and on four testing games.
Analysis of Figure 9:
- further unlocks the potential of on non-poker games (Goofspiel-4, Liar's Dice-4), achieving lower exploitability.
- Crucially, it does so
without sacrificing its performanceon poker games (Leduc Poker, Big Leduc Poker). This demonstrates the plug-and-play nature and general applicability of the dynamic discounting framework as a performance booster.
6.1.7. Predictive Dynamic Discounted CFR (PDDCFR)
The framework is also applied to PDCFR (a combination of and DCFR), resulting in Predictive Dynamic Discounted CFR (PDDCFR).
The following are the results from Figure 10 of the original paper:

该图像是一个图表,展示了DDCFR与两种折扣CFR变体在四个训练游戏中的可剥离性表现。图中比较了CFR+、DCFR和DDCFR在不同迭代次数下的结果,包括小矩阵、Kuhn扑克牌、Goofspiel-3和Liar's Dice-3。
Figure 10: Performance comparison between PDDCFR and PDCFR on four testing games.
Analysis of Figure 10:
PDDCFRsignificantly enhancesPDCFR's performance across both non-poker and poker games (Leduc Poker, Big Leduc Poker, Goofspiel-4, Liar's Dice-4).- This again underscores the generality of the dynamic discounting framework to improve various
CFRbase algorithms.
6.1.8. Visualization of Learned Discounting Scheme
To understand the behavior of the learned policy , its actions () are visualized during the iteration process.
The following are the results from Figure 4 of the original paper:

该图像是一个图表,展示了在四个训练游戏上,DCFR和不同折扣CFR变体的所有迭代结果。图中横轴为迭代次数,纵轴为可利用性,分别显示了CFR+、DCFR和DDCFR在小矩阵、Kuhn扑克、Goofspiel-3和Liar's Dice-3中的性能表现。
Figure 4: The learned discounting scheme behaves differently in various games yet exhibits a similar trend. Compared with DCFR's fixed discounting scheme (we can view DCFR as a special case of DDCFR, where , and the duration ), it is more aggressive in the earlier iterations and becomes more moderate as the iteration progresses.
Analysis of Figure 4:
- Dynamic Adjustment: The actions () vary across different games and change dynamically over iterations, confirming the adaptive nature of the learned scheme.
- General Trends: Despite game-specific variations, a similar overall trend is observed:
- (positive regret discount) generally
increaseswith iterations. Smaller in earlier iterations indicatesmore aggressive discountingof positive regrets. As iterations progress, increases, making the discounting moremoderate. - (negative regret discount) generally
decreaseswith iterations. The scheme consistently usesmore aggressive strategieswhen discounting negative cumulative regrets (smaller means stronger discounting). This is beneficial for rapidly reusing promising actions. - (average strategy discount) generally
decreaseswith iterations. A larger in earlier iterations signifiesmore aggressive discountingof the cumulative strategy, giving more weight to recent strategies. As iterations progress, decreases, leading tomore moderate averaging. - (duration of weights) suggests that discounting weights should be adjusted
more frequently in earlier iterations(smaller ), and less frequently later on (larger ).
- (positive regret discount) generally
- Aggressive Early, Moderate Late: Compared to
DCFR's fixed scheme (), the learned scheme is generally more aggressive in earlier iterations (smaller , larger ). This might be because optimal strategies change frequently in early iterations, benefiting from heavily discounting older information. As learning progresses and strategies stabilize, the scheme becomes more moderate, aiding convergence.
6.2. Ablation Studies / Parameter Analysis
Ablation studies were conducted to assess the effectiveness of DDCFR's various components. Results are reported in Table 1, showing the exploitability on three testing games (Leduc Poker, HUNL Subgame-3, HUNL Subgame-4).
The following are the results from Table 1 of the original paper:
| DDCFR-1 | DDCFR-2 | DDCFR-3 | DDCFR-it | DDCFR-ex | DDCFR-dis | DDCFR-rl | DDCFR | |
| LeducPoker | 1.566e-4 | 1.490e-4 | 1.213e-4 | 1.249e-4 | 1.289e-4 | 1.156e-4 | 1.229e-4 | 1.198e-4 |
| Subgame-3 | 2.533e-3 | 2.481e-3 | 2.393e-3 | 2.292e-3 | 5.320e-3 | 5.795e-3 | 4.987e-3 | 1.963e-3 |
| Subgame-4 | 2.517e-3 | 2.175e-3 | 2.122e-3 | 2.714e-3 | 3.804e-3 | 5.803e-3 | 5.198e-3 | 2.005e-3 |
Analysis of Table 1:
- Number of Training Games (
DDCFR-x):DDCFR-1(trained on one game) performs relatively poorly, indicatingoverfitting.DDCFR-2andDDCFR-3show improved results, as training on multiple games necessitates balancing performance and enhancesgeneralization.- The full
DDCFR(trained on four games) achieves the best performance. Notably, the addition of "Small Matrix" to the training set significantly improves performance on "HUNL Subgame-3" and "HUNL Subgame-4," possibly because both games involve high-loss actions, a property captured by Small Matrix. This confirms the importance of diverse training games for generalization.
- State Representation (
DDCFR-it,DDCFR-ex):DDCFR-it(using only normalized iteration ) performs better thanDDCFR-ex(using only normalized exploitability ). This suggests that the iteration number provides more critical decision-making information to the agent than exploitability alone.- However, both
DDCFR-itandDDCFR-experform worse than the fullDDCFR, indicating thatnormalized iterationandnormalized exploitabilityarecomplementary features, and using both together provides the best performance.
- Action Space Design (
DDCFR-dis):DDCFR-disuses a discrete action space for (discretizing their ranges to integers), whileDDCFRuses a continuous action space.DDCFR-disperforms significantly worse thanDDCFR. This indicates thatcontinuous actionsenable the learned dynamic discounting scheme to exercisemore fine-grained controlover the discounting weights, leading to better performance.
- Optimization Algorithm (
DDCFR-rl):DDCFR-rlusesPPO(Proximal Policy Optimization), a commonReinforcement Learning (RL)algorithm, to optimize the policy.DDCFR-rlunderperformsDDCFR(which usesES) across all testing games. This validates the authors' choice ofESdue to the challenges of sparse rewards, long time horizons, and multi-task learning, which typicalRLalgorithms likePPOstruggle with in this context.
6.2.1. Running Time Comparison
The paper provides a comparison of running times between DCFR and DDCFR to quantify the computational overhead introduced by the dynamic discounting.
The following are the results from Table 2 of the original paper:
| Game | DCFR_time | DDCFR_time | (DDCFR_time -DCFR_time) / DCFR_time |
| Battleship-2 | 0:14:54.558 | 0:14:55.814 | 0.140% |
| Battleship-3 | 11:10:46.895 | 11:12:05.156 | 0.194% |
| Goofspiel-4 | 00:01:03.259 | 00:01:03.518 | 0.409% |
| Liar's Dice-4 | 0:06:57.896 | 0:06:58.577 | 0.163% |
| Leduc Poker | 0:08:13.140 | 0:08:13.921 | 0.158% |
| Big Leduc Poker | 14:55:48.347 | 14:58:07.357 | 0.259% |
| HUNL Subgame-3 | 0:02:54.634 | 0:02:55.024 | 0.223% |
| HUNL Subgame-4 | 0:02:14.379 | 0:02:14.648 | 0.200% |
Analysis of Table 2:
- The table shows that
DDCFRintroduces a very small increase in running time compared toDCFR. The average increase is only . - This negligible overhead confirms the claim that feature calculations (exploitability) and network inference for the policy network add minimal burden to the overall
CFRiteration time, especially with parallel execution of exploitability calculation. The bulk of the computational cost remains within theCFRtree traversal itself.
6.2.2. Logarithmic Normalization for Exploitability
The paper justifies the use of logarithmic normalization for exploitability in the state representation.
The following are the results from Table 3 of the original paper:
| t | 1 | 10 | 20 | 100 | 500 | 1000 |
| Et | 4.583e-1 | 3.302e-2 | 1.599e-2 | 1.767e-3 | 1.699e-4 | 4.021e-5 |
| (log Et − log Emin)/(log E1 − log Emin) | 1.000 | 0.902 | 0.875 | 0.793 | 0.706 | 0.652 |
| (√Et − √Emin)/(√E1 − √Emin) | 1.000 | 0.268 | 0.187 | 0.062 | 0.019 | 0.009 |
| (Et − Emin)/(E1 − Emin) | 1.000 | 0.072 | 0.034 | 0.004 | 0.000 | 0.000 |
Analysis of Table 3:
- Raw Exploitability (): The raw exploitability values decrease very rapidly initially and then slow down significantly. A linear scale would compress most of the meaningful changes into the very early iterations.
- Logarithmic Normalization:
(log Et − log Emin)/(log E1 − log Emin)results in amore evenly distributedrange of values (e.g., from 1.000 down to 0.652). This provides the neural network agent with more discriminative information across different stages ofCFRconvergence. - Other Normalizations: Square root normalization and linear normalization
(Et − Emin)/(E1 − Emin)compress values much more drastically, especially in later iterations, where they quickly approach zero. This would make it harder for the agent to differentiate between states. Thelogarithmic normalizationeffectively spreads out the exploitability values, making it easier for the policy network to learn meaningful distinctions and make better decisions throughout theCFRprocess.
6.3. Deterministic Nature of DDCFR's Iteration Process
The paper clarifies that once a discounting policy is learned using ES (which is a stochastic training process), the subsequent application of DDCFR using this learned policy is entirely deterministic. When testing the learned policy on a game, DDCFR always initializes with a uniform random average strategy, and the policy network produces consistent actions for determined inputs. Therefore, each run of DDCFR with the same learned model on the same game will generate an identical exploitability curve.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Dynamic Discounted CFR (DDCFR), a pioneering framework for solving imperfect-information games by employing an automatically-learned, dynamic discounting scheme. By formalizing the CFR iteration process as a Markov Decision Process (MDP) and framing the discounting scheme learning as a policy optimization problem, DDCFR moves beyond the limitations of fixed, manually-specified discounting rules in previous CFR variants. The use of Evolution Strategies (ES) for training the policy, combined with a game-agnostic state representation, enables the learned scheme to dynamically adapt discounting weights at runtime. Extensive experiments confirm that DDCFR achieves faster convergence and improved performance, demonstrating strong generalization ability across both training and unseen testing games, including complex ones like HUNL subgames. The framework's general applicability is further showcased by its ability to enhance other state-of-the-art CFR variants like and PDCFR.
7.2. Limitations & Future Work
The paper does not explicitly list a "Limitations" section, but some can be inferred:
-
Computational Cost of Training: While the inference time for the learned policy is negligible, the
ES-based training process itself, requiring evaluation across multiple games and numerous perturbed parameters for many epochs, can be computationally intensive (e.g., 24 hours on 200 CPU cores). However, the authors argue this cost is amortized over the reusability of the trained policy across many games. -
Scope: The current work focuses on two-player zero-sum
IIGs. Extending it to multiplayer or general-sum games could introduce new complexities in reward definition, state representation, and policy learning. -
Exploitability Calculation Overhead: While the paper states exploitability calculation is parallelized to avoid wall-clock time overhead during
CFRexecution, theexploitability calculation itself(which requires best response computation) is often one of the most expensive parts of evaluatingCFRalgorithms, especially in large games. This cost is inherent to the reward function and state representation.Potential future research directions implied or suggested by the paper include:
-
Extending the Framework: Applying the dynamic discounting framework to a wider range of
CFRvariants or even other iterative game-solving algorithms. -
More Complex State Representations: Exploring richer state features that might capture more nuanced aspects of the
CFRiteration process or game characteristics, potentially leading to even more sophisticated adaptive policies. -
Online Learning: Investigating methods for online adaptation or fine-tuning of the discounting policy without requiring full
ESretraining. -
Multiplayer/General-Sum Games: Adapting the
DDCFRframework to non-zero-sum or multiplayerIIGs, which present distinct challenges in defining equilibrium and regret.
7.3. Personal Insights & Critique
This paper presents a highly insightful and impactful contribution to the field of imperfect-information game solving. The idea of dynamically learning the hyperparameters of CFR algorithms, rather than relying on fixed, manually-tuned values, is a logical and powerful extension of previous work on discounted CFR variants.
Inspirations:
- Meta-Learning for Algorithm Design: The approach effectively acts as a meta-learning framework that learns to "tune" the
CFRalgorithm itself. This concept could be transferred to optimizing hyperparameters or algorithmic choices in other iterative optimization or learning algorithms, especially those with long execution times and complex dynamics. - MDP Formulation for Abstract Processes: Framing the
CFRiteration as anMDPis elegant. This shows how complex, non-standard optimization processes can be effectively modeled forRLorESoptimization, even when direct gradient computation is difficult. - Effectiveness of ES for Challenging RL Problems: The successful application of
Evolution Strategiesto a problem with sparse rewards, long horizons, and multi-task requirements reinforcesESas a strong alternative to gradient-basedRLin such scenarios. This could inspire more researchers to considerESfor similar meta-learning or hyperparameter optimization tasks. - Generalizability as a Key Metric: The strong emphasis on generalization and the design choices (game-agnostic state, diverse training set) to achieve it are commendable. This highlights that for algorithms intended for broad application, learning an adaptive policy that generalizes is often more valuable than achieving peak performance on a single, manually-tuned instance.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Interpretability of Policy: While Figure 4 provides some insight into the learned trends of , the exact "logic" or complex interplay of these parameters as a function of both iteration and exploitability could be hard to fully interpret. Understanding why the policy makes certain choices might require more advanced interpretability techniques.
-
Sensitivity to Training Game Diversity: The ablation study showed that the number and type of training games significantly impact generalization. This implies that carefully curating a diverse and representative set of training games is crucial for
DDCFR's success, which might still involve some level of manual effort or domain expertise. The "Small Matrix" game, designed to simulate high-loss actions, is a good example of how targeted training examples can enhance generalization for specific game characteristics. -
Scalability for Extremely Large Games: While the running time overhead is small, and
ESis parallelizable, the baseCFRalgorithm itself (which requires game tree traversals) can become prohibitively expensive for truly enormous games (e.g., full-scale no-limit Texas Hold'em).DDCFRimproves the convergence rate, but the fundamental scalability limits of tree-basedCFRstill apply. -
Alternative Reward Functions: The current reward function focuses on overall exploitability reduction. Exploring more nuanced reward functions that might provide denser feedback or target specific convergence behaviors (e.g., stability, avoiding oscillations) could be a fruitful area for future work.
-
Exploring Dynamic Further: The discrete choices for (1, 2, 5, 10, 20) are relatively limited. A continuous (with appropriate regularization) or a more sophisticated adaptive mechanism for deciding when to update the weights could yield further improvements, though it might increase the complexity of the policy and training.
Overall,
DDCFRrepresents a significant step towards more autonomous and efficient game-solving AI, demonstrating the power of meta-learning in optimizing foundational algorithms.
Similar papers
Recommended via semantic vector search.