◇【论文_20151120_20160405v3】Dueling Network 决斗〔Google DeepMind〕

整理代码:Dueling_DQN_+_Pendulum_v1.ipynb

https://arxiv.org/abs/1511.06581

Dueling Network Architectures for Deep Reinforcement Learning

在这里插入图片描述

在这里插入图片描述

文章目录

  • 摘要
  • 1. 引言
    • 1.1. 相关工作
  • 2. 背景
    • 2.1. Deep Q-networks 【DQN】
    • 2.2. Double Deep Q-networks 【DDQN】
    • 2.3. Prioritized Replay 【优先回放】
  • 3. Dueling Network 架构
  • 4. 实验
    • 4.1. 策略评估
    • 4.2. 通用 Atari 游戏
      • 人类启动 (human starts) 的稳健性
      • 结合 优先经验回放
      • 显著性图
  • 5. 讨论
  • 6. 结论
  • 参考文献
  • 附录

摘要

In recent years there have been many successes of using deep representations in reinforcement learning. 【 本工作 所属 研究领域】
近年来,在强化学习中使用深度表示已经取得了许多成功。
Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders.
尽管如此,这些应用中仍有许多使用传统架构,例如卷积网络、LSTMs 或自动编码器。
In this paper, we present a new neural network architecture for model-free reinforcement learning. 【在本文中,我们提出了一种新的用于 … 的 … 架构 】
在本文中,我们提出了一种新的用于无模型强化学习的神经网络架构。
Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. 【 关键 idea 】
我们的 dueling network 表示两个独立的估计器:一个用于状态价值函数,另一个用于依赖状态的动作优势函数。
The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. 【 优势 】
这种分解的主要好处是可以在不改变底层强化学习算法的情况下泛化跨动作的学习。
Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions.
我们的结果表明,在存在许多 价值相似的动作时,这种架构可以导致更好的策略评估。
Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain. 【 评估结果:SOTA 】
此外,决斗架构使我们的 RL 代理在 Atari 2600 领域的表现优于最先进的技术。

1. 引言

【最近的一些里程碑式进展】

Over the past years, deep learning has contributed to dramatic advances in scalability and performance of machine learning (LeCun et al., 2015).
在过去的几年里,深度学习在机器学习的可扩展性和性能方面取得了巨大的进步(LeCun et al., 2015)。
One exciting application is the sequential decision-making setting of reinforcement learning (RL) and control.
一个令人兴奋的应用是 强化学习(RL)和 控制 的序列决策场景。
Notable examples include deep Q-learning (Mnih et al., 2015), deep visuomotor policies (Levine et al., 2015), attention with recurrent networks (Ba et al., 2015), and model predictive control with embeddings (Watter et al., 2015).
值得注意的例子包括 深度 Q-learning 、深度视觉运动策略、注意循环网络 和 嵌入模型预测控制。
Other recent successes include massively parallel frameworks (Nair et al., 2015) and expert move prediction in the game of Go (Maddison et al., 2015), which produced policies matching those of Monte Carlo tree search programs, and squarely beaten a professional player when combined with search (Silver et al., 2016).
最近的其他成功包括大规模并行框架 和 围棋游戏中的专家移动预测,它产生了与蒙特卡洛树搜索程序相匹配的策略,并在与搜索相结合时击败了专业玩家。

【本文打算针对某个点做什么】

In spite of this, most of the approaches for RL use standard neural networks, such as convolutional networks, MLPs, LSTMs and auto-encoders.
尽管如此,RL 的大多数方法都使用标准的神经网络,如卷积网络、MLPs, LSTMs 和自编码器。
The focus in these recent advances has been on designing improved control and RL algorithms, or simply on incorporating existing neural network architectures into RL methods.
这些最新进展的重点是设计改进的控制和强化学习算法,或者简单地将现有的神经网络架构整合到强化学习方法中。
Here, we take an alternative but complementary approach of focusing primarily on innovating a neural network architecture that is better suited for model-free RL.
在这里,我们采取了一种替代但互补的方法,主要关注于 创新 更适合无模型强化学习的神经网络架构
This approach has the benefit that the new network can be easily combined with existing and future algorithms for RL.
这种方法的好处是,新的网络可以很容易地与现有的和未来的强化学习算法相结合。
That is, this paper advances a new network (Figure 1), but uses already published algorithms.
也就是说,本文提出了一个新的网络(图 1),但使用了已经发表的算法。

在这里插入图片描述

Figure 1. A popular single stream Q-network (top) and the dueling Q-network (bottom).
图 1 流行的单流 Q-network(上)和 决斗 Q-network(下)。
The dueling network has two streams to separately estimate (scalar) state-value and the advantages for each action; the green output module implements equation (9) to combine them.
Both networks output Q-values for each action.
决斗网络有两个流分别估计(标量)状态价值 和 每个动作的优势;
绿色输出模块实现式 (9) 将二者组合起来。
两个网络都输出每个动作的 Q 值。

【本文主要工作的要点阐述】

The proposed network architecture, which we name the dueling architecture, explicitly separates the representation of state values and (state-dependent) action advantages.
我们提出的网络架构,我们称之为 决斗架构,明确地分离了 状态价值 的表示 和(依赖于状态的)动作优势 的表示 。
The dueling architecture consists of two streams that represent the value and advantage functions, while sharing a common convolutional feature learning module.
决斗架构 由表示 价值函数 和 优势函数 的两个流组成,同时共享一个通用的卷积特征学习模块
The two streams are combined via a special aggregating layer to produce an estimate of the state-action value function Q as shown in Figure 1.
这两个流通过一个特殊的聚合层组合在一起,产生对状态-动作 价值函数 Q 的估计,如图 1 所示。
This dueling network should be understood as a single Q network with two streams that replaces the popular single-stream Q network in existing algorithms such as Deep Q-Networks (DQN; Mnih et al., 2015).
这种 dueling network 应该被理解为具有两个流的单 Q 网络,它取代了现有算法中流行的单流 Q 网络,如深度 Q 网络 (DQN;Mnih et al., 2015)。
The dueling network automatically produces separate estimates of the state value function and advantage function, without any extra supervision.
dueling network 自动产生状态价值函数 和 优势函数的单独估计,不需要任何额外的监督。

Intuitively, the dueling architecture can learn which states are (or are not) valuable, without having to learn the effect of each action for each state.
直观地说,决斗架构可以了解哪些状态是有价值的(或没有价值的),而无需了解每个状态的每个动作的影响。
This is particularly useful in states where its actions do not affect the environment in any relevant way.
这在其动作不以任何相关方式影响环境的状态尤其有用。
To illustrate this, consider the saliency maps shown in Figure 2.
为了说明这一点, 考虑图 2 中所示的显著性映射。
These maps were generated by computing the Jacobians of the trained value and advantage streams with respect to the input video, following the method proposed by Simonyan et al. (2013).
这些映射是根据 Simonyan 等人(2013)提出的方法,通过计算训练 价值 和 优势 流相对于输入视频的雅可比矩阵生成的。
(The experimental section describes this methodology in more detail.)
(实验部分更详细地描述了这种方法。)
The figure shows the value and advantage saliency maps for two different time steps.
该图显示了两个不同时间步的 价值 和 优势 显著映射。
In one time step (leftmost pair of images), we see that the value network stream pays attention to the road and in particular to the horizon, where new cars appear.
在一个时间步(最左边的一对图像)中,我们看到价值网络流关注道路,特别是地平线,新车出现的地方。
It also pays attention to the score.
它也关注分数。
The advantage stream on the other hand does not pay much attention to the visual input because its action choice is practically irrelevant when there are no cars in front.
另一方面,优势流不太关注视觉输入,因为当前面没有汽车时,它的动作选择实际上是无关紧要的。
However, in the second time step (rightmost pair of images) the advantage stream pays attention as there is a car immediately in front, making its choice of action very relevant.
然而,在第二个时间步(最右边的一对图像)中,优势流注意到前面紧接着有一辆车,这使得它的动作选择非常相关。

【实验部分的阐述】

In the experiments, we demonstrate that the dueling architecture can more quickly identify the correct action during policy evaluation as redundant or similar actions are added to the learning problem.
在实验中,我们证明了决斗架构可以在策略评估期间更快地识别正确的动作,因为学习问题中添加了冗余或类似的动作。

We also evaluate the gains brought in by the dueling architecture on the challenging Atari 2600 testbed.
我们还在具有挑战性的 Atari 2600 测试台上评估了决斗架构带来的收益。
Here, an RL agent with the same structure and hyper-parameters must be able to play 57 different games by observing image pixels and game scores only.
在这里,具有相同结构和超参数的强化学习代理必须能够仅通过观察图像像素游戏分数来玩 57 种不同的游戏。
The results illustrate vast improvements over the single-stream baselines of Mnih et al. (2015) and van Hasselt et al. (2015).
结果表明,与 Mnih 等人(2015)和 van Hasselt 等人(2015)的单流基线相比,有了巨大的改进
The combination of prioritized replay (Schaul et al., 2016) with the proposed dueling network results in the new state-of-the-art for this popular domain.
优先回放(Schaul et al., 2016)与 提出的决斗网络相结合,为这一流行领域带来了最新最先进技术。

1.1. 相关工作

The notion of maintaining separate value and advantage functions goes back to Baird (1993).
维持独立的价值 和 优势函数的概念可以追溯到 Baird(1993)。
In Baird’s original advantage updating algorithm, the shared Bellman residual update equation is decomposed into two updates: one for a state value function, and one for its associated advantage function.
在 Baird 的原始优势更新算法中,将共享 Bellman 残差更新公式 分解为两个更新:一个是对状态价值函数的更新,另一个是对其关联的优势函数的更新。
Advantage updating was shown to converge faster than Q-learning in simple continuous time domains in (Harmon et al., 1995).
在简单连续时域中, 优势更新 比 Q-learning 收敛得更快。
Its successor, the advantage learning algorithm, represents only a single advantage function (Harmon & Baird, 1996).
它的后继算法,优势学习算法,只代表一个单一的优势函数(哈蒙和贝尔德,1996)。

The dueling architecture represents both the value V(s) and advantage A(s, a) functions with a single deep model whose output combines the two to produce a state-action value Q(s, a).
决斗架构 用单个深度模型表示 价值 V ( s ) V(s) V(s) 和 优势 A ( s , a ) A(s,a) A(s,a) 函数,该模型的输出将两者结合起来,产生状态-动作价值 Q ( s , a ) Q(s,a) Q(s,a)
Unlike in advantage updating, the representation and algorithm are decoupled by construction.
与优势更新不同,表示 和 算法通过构造解耦
Consequently, the dueling architecture can be used in combination with a myriad of model free RL algorithms.
因此,决斗架构可以与无数的无模型 RL 算法结合使用。

There is a long history of advantage functions in policy gradients, starting with (Sutton et al., 2000).
从(Sutton et al., 2000)开始,策略梯度中的优势函数有很长的历史
As a recent example of this line of work, Schulman et al. (2015) estimate advantage values online to reduce the variance of policy gradient algorithms.
最近的一个例子是,Schulman 等人(2015)在线估计优势值,以减少策略梯度算法的方差。

There have been several attempts at playing Atari with deep reinforcement learning, including Mnih et al. (2015); Guo et al. (2014); Stadie et al. (2015); Nair et al. (2015); van Hasselt et al. (2015); Bellemare et al. (2016) and Schaul et al. (2016).
已经有几次尝试用深度强化学习玩 Atari,包括 已经有几次尝试用深度强化学习玩Atari,Mnih et al. (2015); Guo et al. (2014); Stadie et al. (2015); Nair et al. (2015); van Hasselt et al. (2015); Bellemare et al. (2016) and Schaul et al. (2016).
The results of Schaul et al. (2016) are the current published state-of-the-art.
Schaul et al.(2016)的结果是目前发表的最先进的。

2. 背景

我们考虑一个序列决策设置,其中代理agent 与环境 E \mathcal E E 在离散时间步相互作用,参见 Sutton 和 Barto(1998)的介绍。
例如,在 Atari 领域,agent 在时间步 t t t 感知到一个由 M M M 帧图像组成的视频 s t s_t st s t = ( x t − M + 1 , ⋯ , x t ) ∈ S s_t = (x_{t-M+1},\cdots,x_t) \in {\cal S} st=(xtM+1,,xt)S
然后 agent 从 离散集合 a t ∈ A = { 1 , ⋯ , ∣ A ∣ } a_t \in {\cal A} = \{1,\cdots,|\cal A|\} atA={1,,A} 中选择一个动作,并观察游戏模拟器产生的奖励信号 r t r_t rt

$\mathcal E$ E ~~~\mathcal E    E

agent 寻求最大的折口回报期望,我们将折扣回报定义为 R t = ∑ τ = t ∞ γ τ − t r τ R_t= \sum\limits_{\textcolor{blue}{\tau}=t}^\inftyγ^{\textcolor{blue}{\tau}-t}r_{\textcolor{blue}{\tau}} Rt=τ=tγτtrτ
在这个公式中, γ ∈ [ 0 , 1 ] \gamma \in [0,1] γ[0,1] 是一个折扣因子,它权衡即时 和 未来奖励的重要性

对于一个按照随机策略 π π π 行为的 agent,状态-动作 对 ( s , a ) (s, a) (s,a) 和 状态 s s s 的价值定义如下:
~  
Q π ( s , a ) = E [ R t ∣ s t = s , a t = a , π ] Q^\pi(s,a)={\mathbb E}[R_t|s_t=s,a_t=a,\pi] Qπ(s,a)=E[Rtst=s,at=a,π]
~  
V π ( s ) = E a ∼ π ( s ) [ Q π ( s , a ) ] ( 1 ) V^\pi(s)={\mathbb E}_{a\sim \pi(s)}[Q^\pi(s,a)]~~~~~~~~~~(1) Vπ(s)=Eaπ(s)[Qπ(s,a)]          (1)
~  
The preceding state-action value function (Q function for short) can be computed recursively with dynamic programming:
上述 状态-动作 价值函数(简称 Q 函数)可以用动态规划递归计算:
~  
Q π ( s , a ) = E s ′ [ r + γ E a ′ ∼ π ( s ′ ) [ Q π ( s ′ , s ′ ) ] ∣ s , a , π ] Q^\pi(s,a)={\mathbb E}_{s^\prime}\Big[r+\gamma {\mathbb E}_{a^\prime\sim \pi(s^\prime)}[Q^\pi(s^\prime,s^\prime)]|s,a,\pi\Big] Qπ(s,a)=Es[r+γEaπ(s)[Qπ(s,s)]s,a,π]
~  
定义最优 Q ∗ ( s , a ) = max ⁡ π Q π ( s , a ) Q^*(s,a) = \max_\pi Q^\pi(s,a) Q(s,a)=maxπQπ(s,a)
在确定性策略 a = arg ⁡ max ⁡ a ′ ∈ A Q ∗ ( s , a ′ ) a = \arg \max_{a^\prime\in {\cal A}}Q^*(s,a ') a=argmaxaAQ(s,a) 下,可以得到 V ∗ ( s ) = max ⁡ a Q ∗ ( s , a ) V^*(s) = \max_aQ^*(s, a) V(s)=maxaQ(s,a)
由此也可以得到最优 Q Q Q 函数满足 Bellman 公式:
~  
Q ∗ ( s , a ) = E s ′ [ r + γ max ⁡ a ′ Q ∗ ( s ′ , a ′ ) ∣ s , a ] ( 2 ) Q^*(s,a)={\mathbb E}_{s^\prime}\Big[r+\gamma \max\limits_{a^\prime} Q^*(s^\prime, a^\prime)|s,a\Big]~~~~~~~~~~(2) Q(s,a)=Es[r+γamaxQ(s,a)s,a]          (2)
~  
We define another important quantity, the advantage function, relating the value and Q functions:
我们定义了另一个重要的量,即优势函数,它将 价值 和 Q 函数联系起来:
~  
A π ( s , a ) = Q π ( s , a ) − V π ( s ) ( 3 ) A^\pi(s,a)=Q^\pi(s,a)-V^\pi(s)~~~~~~~~~~(3) Aπ(s,a)=Qπ(s,a)Vπ(s)          (3)
~  
注意 E a ∼ π ( s ) [ A π ( s , a ) ] = 0 {\mathbb E}_{a\sim\pi(s)}[A^\pi(s,a)]=0 Eaπ(s)[Aπ(s,a)]=0
Intuitively, the value function V V V measures the how good it is to be in a particular state s s s.
直观地说,价值函数 V V V 衡量的是处于某一特定状态 s s s 有多好
The Q function, however, measures the the value of choosing a particular action when in this state.
然而,Q 函数测量的是在这种状态下选择特定动作的价值。
The advantage function subtracts the value of the state from the Q function to obtain a relative measure of the importance of each action.
优势函数从 Q 函数中减去状态价值,以获得每个动作重要性的相对度量。

2.1. Deep Q-networks 【DQN】

前面描述的价值函数是高维对象。
为了近似它们,我们可以使用一个深度 Q 网络: Q ( s , a ; θ ) Q(s, a;θ) Q(s,a;θ),参数为 θ θ θ
为了估计这个网络,我们在迭代 i i i 时优化如下损失函数序列:
~  
L i ( θ i ) = E s , a , r , s ′ [ ( y i DQN − Q ( s , a ; θ i ) ) 2 ] ( 4 ) L_i(\theta_i)={\mathbb E}_{s,a,r,s^\prime}\Big[\Big(y_i^\text{DQN}-Q(s,a;\theta_i)\Big)^2\Big]~~~~~~~~~~(4) Li(θi)=Es,a,r,s[(yiDQNQ(s,a;θi))2]          (4)
~  
其中 y i DQN = r + γ max ⁡ a ′ Q ( s ′ , a ′ ; θ − ) ( 5 ) ~~~~~y_i^\text{DQN}=r+\gamma \max\limits_{a^\prime}Q(s^\prime,a^\prime;\theta^-)~~~~~~~~~~(5)      yiDQN=r+γamaxQ(s,a;θ)          (5)
~  
式中 θ − \theta^- θ 表示固定且独立的目标网络参数。
我们可以尝试使用标准的 Q-learning 来学习在线网络 Q ( s , a ; θ ) Q(s, a;\theta) Q(s,a;θ)
然而,这个估计器在实践中表现不佳。
(Mnih et al., 2015)的一个关键创新是在通过梯度下降更新在线网络 Q ( s , a ; θ i ) Q(s, a;\theta_i) Q(s,a;θi) 的同时,将目标网络 Q ( s ′ , a ′ ; θ − ) Q(s', a ';θ^-) Q(s,a;θ) 的参数 冻结 固定的迭代次数。(这大大提高了算法的稳定性。)
具体的梯度更新为
~  
∇ θ i L i ( θ i ) = E s , a , r , s ′ [ ( y i DQN − Q ( s , a ; θ i ) ) ∇ θ i Q ( s , a ; θ i ) ] \nabla_{\theta_i}L_i(\theta_i)={\mathbb E}_{s,a,r,s^\prime}\Big[\Big(y_i^\text{DQN}-Q(s,a;\theta_i)\Big)\nabla_{\theta_i}Q(s,a;\theta_i)\Big] θiLi(θi)=Es,a,r,s[(yiDQNQ(s,a;θi))θiQ(s,a;θi)]

This approach is model free in the sense that the states and rewards are produced by the environment.
这种方法是无需模型的,因为 状态 和 奖励 是由环境产生的。
It is also off-policy because these states and rewards are obtained with a behavior policy (epsilon greedy in DQN) different from the online policy that is being learned.
它也是 异策略off-policy,因为这些状态 和 奖励 是通过与 正在学习的在线策略 不同的 行为策略(DQN 中的 epsilon greedy )获得的。

正在学习的: 在线策略
交互获得 状态 和 奖励 的 策略: 行为策略

  • 两者不同 ——> off-policy

DQN 成功的另一个关键因素是经验回放experience replay (Lin, 1993;Mnih et al., 2015)。
在学习过程中,agent 将很多回合的经验 e t = ( s t , a t , r t , s t + 1 ) e_t = (s_t, a_t, r_t, s_{t+1}) et=(st,at,rt,st+1) 累积到一个数据集 D t = { e 1 , e 2 , ⋯ , e t } {\cal D}_t = \{e_1, e_2,\cdots,e_t\} Dt={e1,e2,,et} 里。
在训练 Q 网络时,不是只使用标准时序差分学习规定的当前经验,而是通过从 D \cal D D 中均匀随机采样小批经验来训练网络。
损失序列的形式为:
~  
L i ( θ i ) = E ( s , a , r , s ′ ) ∼ U ( D ) [ ( y i DQN − Q ( s , a ; θ i ) ) 2 ] L_i(\theta_i)={\mathbb E}_{\textcolor{blue}{(s,a,r,s^\prime)\sim {\cal U(D)}}}\Big[\Big(y_i^\text{DQN}-Q(s,a;\theta_i)\Big)^2\Big] Li(θi)=E(s,a,r,s)U(D)[(yiDQNQ(s,a;θi))2]

Experience replay increases data efficiency through re-use of experience samples in multiple updates and, importantly, it reduces variance as uniform sampling from the replay buffer reduces the correlation among the samples used in the update.
经验回放通过在多个更新中重用经验样本提高数据效率,重要的是,它减少了方差,因为来自 replay buffer 的均匀采样降低了更新中使用的样本之间的相关性。

2.2. Double Deep Q-networks 【DDQN】

The previous section described the main components of DQN as presented in (Mnih et al., 2015).
前一节描述了(Mnih et al., 2015)中提出的 DQN 的主要成分。
In this paper, we use the improved Double DQN (DDQN) learning algorithm of van Hasselt et al. (2015).
在本文中,我们使用 van Hasselt et al.(2015)改进的 Double DQN (DDQN)学习算法。
In Q-learning and DQN, the max operator uses the same values to both select and evaluate an action.
在 Q-learning 和 DQN 中,最大算子使用相同的值选择和评估一个动作
This can therefore lead to overoptimistic value estimates (van Hasselt, 2010).
因此,这可能导致过于乐观的价值估计(van Hasselt, 2010)。
To mitigate this problem, DDQN uses the following target:
为了缓解这个问题,DDQN 使用以下目标:
~  
y i DDQN = r + γ Q ( s ′ , arg ⁡ max ⁡ a ′ Q ( s ′ , a ′ ; θ i ) ⏞ 参数为  θ i ; θ − ) ⏟ 参数为  θ − ( 6 ) y_i^\text{DDQN}=r+\gamma \underbrace{Q(s^\prime,\arg\max\limits_{a^\prime}\overbrace{Q(s^\prime,a^\prime;\theta_i)}^{参数为 ~\textcolor{blue}{\theta_i}};\theta^-)}_{参数为~\textcolor{blue}{\theta^-}}~~~~~~~~~~(6)~~~~~~~~~~~~~~ yiDDQN=r+γ参数为 θ Q(s,argamaxQ(s,a;θi) 参数为 θi;θ)          (6)              多迭代几次,找到更好的动作后,再更新 更直接影响目标的 权重 θ − \theta^- θ
~  
DDQN 与 DQN 相同(见 Mnih et al.(2015)),但 DDQN 的目标 y i DQN y_i^\text{DQN} yiDQN y i DDQN y_i^\text{DDQN} yiDDQN 取代。
DDQN 的伪代码见附录 A

2.3. Prioritized Replay 【优先回放】

A recent innovation in prioritized experience replay (Schaul et al., 2016) built on top of DDQN and further improved the state-of-the-art.
最近在优先经验回放方面的创新(Schaul et al., 2016)建立在 DDQN 之上,进一步提高了技术水平。
Their key idea was to increase the replay probability of experience tuples that have a high expected learning progress (as measured via the proxy of absolute TD-error).
他们的主要想法是增加具有高的学习进步期望的经验元组的回放概率(通过绝对 TD-error 的代理来衡量)。
This led to both faster learning and to better final policy quality across most games of the Atari benchmark suite, as compared to uniform experience replay.
与均匀的经验回放相比,这在 Atari 基准套件的大多数游戏中都能带来更快的学习更好的最终策略质量

To strengthen the claim that our dueling architecture is complementary to algorithmic innovations, we show that it improves performance for both the uniform and the prioritized replay baselines (for which we picked the easier to implement rank-based variant), with the resulting prioritized dueling variant holding the new state-of-the-art.
为了加强 我们的决斗架构是对算法创新的补充,我们表明它提高了均匀和优先回放基线的性能(我们选择了更容易实现基于排名的变体),结果优先决斗变体成为新的最先进技术。

3. Dueling Network 架构

The key insight behind our new architecture, as illustrated in Figure 2, is that for many states, it is unnecessary to estimate the value of each action choice.
如图 2 所示,我们的新架构背后的关键见解是,对于许多状态,没有必要估计每个动作选择的价值
For example, in the Enduro game setting, knowing whether to move left or right only matters when a collision is eminent.
例如,在 Enduro 游戏设置中,只有在碰撞明显时才知道向左或向右移动。
In some states, it is of paramount importance to know which action to take, but in many other states the choice of action has no repercussion on what happens.
在一些状态,知道执行哪个动作是至关重要的,但在许多其他状态,动作的选择对发生的事情没有影响。
For bootstrapping based algorithms, however, the estimation of state values is of great importance for every state.
然而,对于基于自举的算法,状态价值的估计对于每个状态都是非常重要的。

To bring this insight to fruition, we design a single Q-network architecture, as illustrated in Figure 1, which we refer to as the dueling network.
为了实现这一见解,我们设计了一个单一的 Q 网络架构,如图 1 所示,我们将其称为 dueling network。
The lower layers of the dueling network are convolutional as in the original DQNs (Mnih et al., 2015).
dueling network 的下层与原始 DQNs 一样是卷积的(Mnih et al., 2015)。
However, instead of following the convolutional layers with a single sequence of fully connected layers, we instead use two sequences (or streams) of fully connected layers.
然而,我们不是用一个完全连接层序列来跟踪卷积层,而是使用两个完全连接层序列(或流)
The streams are constructed such that they have they have the capability of providing separate estimates of the value and advantage functions.
流是这样构造的,它们有能力提供价值和优势函数的单独估计。
Finally, the two streams are combined to produce a single output Q function.
最后,将这两个流组合在一起以产生单个输出 Q 函数。
As in (Mnih et al., 2015), the output of the network is a set of Q values, one for each action.
如( Mnih 等人,2015 )所述,网络的输出是一组 Q 值,每个动作对应一个 Q 值

Since the output of the dueling network is a Q function, it can be trained with the many existing algorithms, such as DDQN and SARSA.
由于 dueling network 的输出是一个 Q 函数,因此可以使用许多现有的算法进行训练,例如 DDQN 和 SARSA。
In addition, it can take advantage of any improvements to these algorithms, including better replay memories, better exploration policies, intrinsic motivation, and so on.
此外,它还可以利用这些算法的任何改进,包括更好的 replay memories 、更好的探索策略、内在动机等等。

The module that combines the two streams of fully-connected layers to output a Q estimate requires very thoughtful design.
结合两个完全连接的层流来输出 Q 估计的模块需要非常周到的设计。

由优势 Q π ( s , a ) = V π ( s ) + A π ( s , a ) Q^\pi(s,a) = V^\pi(s) + A^\pi(s,a) Qπ(s,a)=Vπ(s)+Aπ(s,a) 和 状态价值 V π ( s ) = E a ∼ π ( s ) [ Q π ( s , a ) ] V^\pi(s) = {\mathbb E}_{a\simπ(s)} [Q^\pi(s,a)] Vπ(s)=Eaπ(s)[Qπ(s,a)] 的表达式可知, E a ∼ π ( s ) [ A π ( s , a ) ] = 0 {\mathbb E}_{a\simπ(s)}[A^\pi(s,a)] = 0 Eaπ(s)[Aπ(s,a)]=0
此外,对于确定性策略 a ∗ = arg ⁡ max ⁡ a ′ ∼ A Q ( s , a ′ ) a^* = \arg \max_{a^\prime\sim {\cal A}}Q(s, a') a=argmaxaAQ(s,a),可以得出 Q ( s , a ∗ ) = V ( s ) Q(s, a^*) = V(s) Q(s,a)=V(s),因此 A ( s , a ∗ ) = 0 A (s, a^*) = 0 A(s,a)=0

让我们考虑图 1 所示的 dueling network ,其中我们使一个完全连接的层流输出标量 V ( s ; θ , β ) V(s;θ, β) V(s;θ,β),另一个流输出一个 ∣ A ∣ |\cal A| A 维向量 A ( s , a ; θ , α ) A(s,a;θ,α) A(s,a;θ,α)
其中, θ θ θ 表示卷积层的参数,而 α \alpha α β β β 是两个完全连接层流的参数。

Using the definition of advantage, we might be tempted to construct the aggregating module as follows:
利用优势的定义,我们可以构造如下的聚合模块:
~  
Q ( s , a ; θ , α , β ) ⏟ 真正的  Q 函数的参数化估计 = V ( s ; θ , β ) + A ( s , a ; θ , α ) ( 7 ) \underbrace{Q(s,a;\textcolor{blue}{\theta,\alpha,\beta})}_{真正的 ~Q~函数的参数化估计}=V(s;\textcolor{blue}{θ, β})+A(s,a;\textcolor{blue}{θ,α})~~~~~~~~~~(7) 真正的 Q 函数的参数化估计 Q(s,a;θ,α,β)=V(s;θ,β)+A(s,a;θ,α)          (7)
~  
注意,这个表达式适用于所有 ( s , a ) (s, a) (s,a) 实例;
也就是说,为了将式 (7) 表示成矩阵形式,我们需要复制标量 V ( s ; θ , β ) V(s;θ, β) V(s;θ,β) ∣ A ∣ |\cal A| A 次。

然而,我们需要记住 Q ( s , a ; θ , α , β ) Q(s,a;\theta,\alpha,\beta) Q(s,a;θ,α,β) 只是真正的 Q 函数的参数化估计。
此外,得出 V ( s ; θ , β ) V(s;θ, β) V(s;θ,β) 是状态价值函数的良好估计量的结论是错误的,或者同样地,得出 A ( s , a ; θ , α ) A(s,a;θ,α) A(s,a;θ,α) 提供了优势函数的合理估计的结论也是错误的。

式 (7) 是不可识别的,因为给定 Q Q Q,我们不能唯一地恢复 V V V A A A
要看这个,给 V ( s ; θ , β ) V(s;θ, β) V(s;θ,β) 加一个常数,并从 A ( s , a ; θ , α ) A(s,a;θ,α) A(s,a;θ,α) 中减去相同的常数。
该常数约掉,得到相同的 Q Q Q 值。
当直接使用该式时,这种可识别性的缺乏反映了较差的实际性能

To address this issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action.
为了解决这个可识别性问题,我们可以强制优势函数估计器在选择的动作上具有零优势。
That is, we let the last module of the network implement the forward mapping
也就是说,我们让网络的最后一个模块实现正向映射
~  
Q ( s , a ; θ , α , β ) = V ( s ; θ , β ) + ( A ( s , a ; θ , α ) − max ⁡ a ′ ∈ ∣ A ∣ A ( s , a ′ ; θ , α ) ) ( 8 ) Q(s,a;\theta,\alpha,\beta)=V(s;θ, β)+\Big(A(s,a;θ,α)\textcolor{blue}{-\max\limits_{a^\prime\in{|\cal A|}}A(s,a^\prime;\theta,\alpha)}\Big)~~~~~~~~~~(8) Q(s,a;θ,α,β)=V(s;θ,β)+(A(s,a;θ,α)aAmaxA(s,a;θ,α))          (8)
~  
现在,对于 a ∗ = arg ⁡ max ⁡ a ′ ∈ A Q ( s , α ′ ; θ , α , β ) = arg ⁡ max ⁡ a ′ ∈ A A ( s , a ′ ; θ , α ) a^*=\arg\max_{a^\prime\in {\cal A}}Q(s, α';θ, α, β)=\arg \max _{a^\prime\in {\cal A}}A(s, a';θ, α) a=argmaxaAQ(s,α;θ,α,β)=argmaxaAA(s,a;θ,α),我们得到 Q ( s , a ∗ ; θ , α , β ) = V ( s ; θ , β ) Q(s, a^*;θ, α, β) =V(s; \theta, β) Q(s,a;θ,α,β)=V(s;θ,β)
因此,流 V ( s ; θ , β ) V(s; \theta, β) V(s;θ,β) 提供了对价值函数的估计,而另一个流则产生了对优势函数的估计。

An alternative module replaces the max operator with anaverage:
一个替代模块将 max 操作符替换为平均值:
~  
Q ( s , a ; θ , α , β ) = V ( s ; θ , β ) + ( A ( s , a ; θ , α ) − 1 ∣ A ∣ ∑ a ′ A ( s , a ′ ; θ , α ) ) ( 9 ) ~~~Q(s,a;\theta,\alpha,\beta)=V(s;θ, β)+\Big(A(s,a;θ,α)\textcolor{blue}{-\frac{1}{|\cal A|}\sum\limits_{a^\prime}}A(s,a^\prime;\theta,\alpha)\Big)~~~~~~~~~~(9)    Q(s,a;θ,α,β)=V(s;θ,β)+(A(s,a;θ,α)A1aA(s,a;θ,α))          (9)
~  
On the one hand this loses the original semantics of V V V and A A A because they are now off-target by a constant, but on the other hand it increases the stability of the optimization: with (9) the advantages only need to change as fast as the mean, instead of having to compensate any change to the optimal action’s advantage in (8).
一方面,这失去了 V V V A A A 的原始语义,因为它们现在偏离了一个常数,但另一方面,它增加了优化的稳定性:在 (9) 中,优势只需要与平均值一样快地变化,而不必补偿 (8) 中最优动作的优势的任何变化。
We also experimented with a softmax version of equation (8), but found it to deliver similar results to the simpler module of equation (9).
我们还用式 (8) 的 softmax 版本进行了实验,但发现它与式 (9) 的简单模块提供了类似的结果。
Hence, all the experiments reported in this paper use the module of equation (9).
因此,本文报道的所有实验均采用式 (9) 的模块。

Note that while subtracting the mean in equation (9) helps with identifiability, it does not change the relative rank of the A (and hence Q) values, preserving any greedy or ϵ \epsilon ϵ-greedy policy based on Q values from equation (7).
请注意,虽然减去式 (9) 中的平均值有助于可辨识性,但它不会改变 A A A(因此也不会改变 Q Q Q )值的相对排名,从而保留基于式 (7) 中的 Q 值的任何贪心 或 ϵ \epsilon ϵ-贪心策略。
When acting, it suffices to evaluate the advantage stream to make decisions.
在行动时,评估优势流就足以做出决定

It is important to note that equation (9) is viewed and implemented as part of the network and not as a separate algorithmic step.
值得注意的是,式 (9) 被视为网络的一部分,而不是单独的算法步骤。
Training of the dueling architectures, as with standard Q networks (e.g. the deep Q-network of Mnih et al. (2015)), requires only back-propagation.
决斗架构的训练,如标准 Q 网络(例如 Mnih 等人(2015)的深度 Q 网络),只需要反向传播。
The estimates V ( s ; θ , β ) V(s; θ, β) V(s;θ,β and A ( s , a ; θ , α ) A(s, a; θ, α) A(s,a;θ,α) are computed automatically without any extra supervision or algorithmic modifications.
估计 V ( s ; θ , β ) V(s;θ, β) V(s;θ,β) A ( s , a ; θ , α ) A(s, a;θ, α) A(s,a;θ,α) 是自动计算的,不需要任何额外的监督或算法修改。

As the dueling architecture shares the same input-output interface with standard Q networks, we can recycle all learning algorithms with Q-networks (e.g., DDQN and SARSA) to train the dueling architecture.
由于决斗架构 与标准 Q 网络共享相同的输入输出接口,我们可以循环使用 Q 网络(例如 DDQN 和 SARSA)的所有学习算法来训练决斗架构。

4. 实验

We now show the practical performance of the dueling network.
现在我们展示 dueling network 的实际性能。
We start with a simple policy evaluation task and then show larger scale results for learning policies for general Atari game-playing.
我们从一个简单的策略评估任务开始,然后展示了学习一般 Atari 游戏策略的更大规模结果。

4.1. 策略评估

We start by measuring the performance of the dueling architecture on a policy evaluation task.
我们从度量 决斗架构 在策略评估任务上的性能开始。
We choose this particular task because it is very useful for evaluating network architectures, as it is devoid of confounding factors such as the choice of exploration strategy, and the interaction between policy improvement and policy evaluation.
我们选择这个特定的任务是因为它对于评估网络架构非常有用,因为它没有混淆因素,例如探索策略的选择,以及策略改进和策略评估之间的相互作用。

在这个实验中,我们使用时序差分学习(没有资格跟踪eligibility traces,即 λ \lambda λ = 0)来学习 Q 值。
更具体地说,给定一个行为策略 π π π,我们寻求通过优化式 (4) 的代价序列来估计状态-动作价值 Q π ( ⋅ , ⋅ ) Q^\pi(·,·) Qπ(⋅,⋅)
~  
y i = r + γ E a ′ ∼ π ( s ′ ) [ Q ( s ′ , a ′ ; θ i ) ] y_i=r+\gamma {\mathbb E}_{a^\prime\sim\pi(s^\prime)}[Q(s^\prime,a^\prime;\theta_i)] yi=r+γEaπ(s)[Q(s,a;θi)]
~  
The above update rule is the same as that of Expected SARSA (van Seijen et al., 2009).
上述更新规则与 Expected SARSA 相同(van Seijen et al., 2009)。
We, however, do not modify the behavior policy as in Expected SARSA.
但是,我们不会像 Expected SARSA 那样修改行为策略。

To evaluate the learned Q values, we choose a simple environment where the exact Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) values can be computed separately for all ( s , a ) ∈ S × A (s, a)\in {\cal S} ×{\cal A} (s,a)S×A.
为了评估 习得的 Q 值,我们选择了一个简单的环境,可以对这个环境中的所有 ( s , a ) ∈ S × A (s, a)\in {\cal S} ×{\cal A} (s,a)S×A 单独计算精确的 Q π ( s , a ) Q^\pi(s, a) Qπ(s,a) 值。
This environment, which we call the corridor is composed of three connected corridors.
这个环境,我们称之为走廊,是由三条相连的走廊组成的。
A schematic drawing of the corridor environment is shown in Figure 3, The agent starts from the bottom left corner of the environment and must move to the top right to get the largest reward.
走廊环境的示意图如图 3 所示,agent 从环境的左下角开始,必须移动到右上角才能获得最大的奖励。
A total of 5 actions are available: go up, down, left, right and no-op.
总共有 5 个动作可用:上,下,左,右 和 无操作。
We also have the freedom of adding an arbitrary number of no-op actions. In our setup, the two vertical sections both have 10 states while the horizontal section has 50.
我们还可以自由地添加任意数量的无操作动作。
在我们的设置中,两个垂直部分都有 10 个状态,而水平部分有 50 个状态。

在这里插入图片描述

Figure 3.
图 3
(a) The corridor environment.
走廊环境。
The star marks the starting state.
星形 标志 起始状态。
The redness of a state signifies the reward the agent receives upon arrival.
状态的红色表示代理到达时收到的奖励。
The game terminates upon reaching either reward state.
游戏在达到任何一种奖励状态时终止。
The agent’s actions are going up, down, left, right and no action.
agent 的动作为向上,向下,向左,向右,没有动作。
Plots (b), (c) and (d) shows squared error for policy evaluation with 5, 10, and 20 actions on a log-log scale.
图 (b)、(c) 和 (d) 显示了在对数-对数尺度上对 5、10 和 20 个动作进行策略评估的平方误差。
The dueling network (Duel) consistently outperforms a conventional single-stream network (Single), with the performance gap increasing with the number of actions.
决斗网络(Duel)始终优于传统的单流网络(Single),性能差距随着操作数量的增加而增加。

我们使用一个 ϵ \epsilon ϵ-greedy 策略作为行为策略 π π π,它以概率 ϵ \epsilon ϵ 选择一个随机动作 或 以概率 1 − ϵ 1 - \epsilon 1ϵ 选择 根据最优 Q 函数 arg ⁡ max ⁡ a ∈ A Q ∗ ( s , a ) \arg \max_{a\in{\cal A}}Q^*(s, a) argmaxaAQ(s,a) 获得的动作。
在我们的实验中, ϵ \epsilon ϵ 被选择为 0.001。

我们将单流 Q 架构 与 决斗架构在走廊环境的 3 个变体中进行比较,分别有 5 个、10 个和 20 个动作。
10 和 20 个动作变体是通过在原始环境中添加无操作而形成的。
我们通过对真实状态价值的平方误差(SE)来衡量性能: ∑ s ∈ S , a ∈ A ( Q ( s , a ; θ ) − Q π ( s , a ) ) 2 \sum_{s\in{\cal S},a\in{\cal A}}\Big(Q(s, a;θ) -Q^\pi(s, a)\Big)^2 sS,aA(Q(s,a;θ)Qπ(s,a))2
单流架构是一个三层 MLP,每个隐藏层上有 50 个单元
决斗架构也由三层组成。
然而,在第一个包含 50 个单元的隐藏层之后,网络分支成两个流,每个流是一个包含 25个 隐藏单元的两层 MLP
对比结果总结在图 3.

The results show that with 5 actions, both architectures converge at about the same speed.
结果表明,在 5 个动作下,两种架构的收敛速度大致相同
However, when we increase the number of actions, the dueling architecture performs better than the traditional Q-network.
然而,当我们增加动作的数量时,决斗架构比传统的 Q 网络表现得更好
In the dueling network, the stream V ( s ; θ , β ) V(s;θ,β) V(s;θ,β) learns a general value that is shared across many similar actions at s, hence leading to faster convergence.
在决斗网络中,流 V ( s ; θ , β ) V(s;θ,β) V(s;θ,β) 学习一个泛化值,该值在 s s s 处被许多类似的动作共享,从而导致更快的收敛。
This is a very promising result because many control tasks with large action spaces have this property, and consequently we should expect that the dueling network will often lead to much faster convergence than a traditional single stream network.
这是一个非常有前景的结果,因为许多具有大动作空间的控制任务都具有此属性,因此我们应该期望决斗网络通常会比传统的单流网络更快地收敛。
In the following section, we will indeed see that the dueling network results in substantial gains in performance in a wide-range of Atari games.
在下一节中,我们将看到决斗网络在各种 Atari 游戏中带来了实质性的性能提升。

4.2. 通用 Atari 游戏

We perform a comprehensive evaluation of our proposed method on the Arcade Learning Environment (Bellemare et al., 2013), which is composed of 57 Atari games.
我们在 Arcade Learning Environment (Bellemare et al., 2013)上对我们提出的方法进行了全面评估,该环境由 57 款 Atari 游戏组成。
The challenge is to deploy a single algorithm and architecture, with a fixed set of hyper-parameters, to learn to play all the games given only raw pixel observations and game rewards.
挑战是部署一个单一的算法和架构,使用一组固定的超参数,在只给出原始像素观察游戏奖励的情况下学会玩所有的游戏。
This environment is very demanding because it is both comprised of a large number of highly diverse games and the observations are high-dimensional.
这个环境要求很高,因为它既包含大量高度多样化的游戏,又包含高维的观察

We follow closely the setup of van Hasselt et al. (2015) and compare to their results using single-stream Q-networks.
我们严密遵循 van Hasselt 等人(2015)的设置,并使用单流 Q-networks 与他们的结果进行比较。
We train the dueling network with the DDQN algorithm as presented in Appendix A.
我们使用如附录 A 所示的 DDQN 算法训练决斗网络。
At the end of this section, we incorporate prioritized experience replay (Schaul et al., 2016).
在本节的最后,我们结合了优先经验回放(Schaul et al., 2016)。

Our network architecture has the same low-level convolutional structure of DQN (Mnih et al., 2015; van Hasselt et al., 2015).
我们的网络架构具有与 DQN 相同的低阶卷积结构(Mnih et al., 2015;van Hasselt et al., 2015)。
There are 3 convolutional layers followed by 2 fully-connected layers.
The first convolutional layer has 32 8x8 filters with stride 4, the second 64 4× 4 filters with stride 2, and the third and final convolutional layer consists 64 3 × 3 filters with stride 1.
有 3 个卷积层,后面跟着 2 个全连接层。
第一个卷积层有 32 个 8 × 8 滤波器,步幅为 4,
第二个卷积层有 64 个 4× 4 滤波器,步幅为 2,
第三个也是最后一个卷积层有 64 个 3 × 3 滤波器,步幅为 1。〔 这样每层获得的 feature map 不是一样吗,不一样的话是什么造成的? 〕
As shown in Figure 1, the dueling network splits into two streams of fully connected layers.
如图 1 所示,决斗网络分成两个全连接层的流。
The value and advantage streams both have a fully-connected layer with 512 units.
价值流 和 优势流都有一个全连接层,有 512 个单元。
The final hidden layers of the value and advantage streams are both fully-connected with the value stream having one output and the advantage as many outputs as there are valid actions2.
价值流 和 优势流的最后隐藏层都是全连接的,其中价值流有一个输出,优势有多少有效的动作就有多少输出。

  • 脚注 2:The number of actions ranges between 3-18 actions in the ALE environment.
    在 ALE 环境中,动作的数量在 3-18 之间。

We combine the value and advantage streams using the module described by Equation (9).
我们使用 式 (9) 所描述的模块将价值流 和 优势流结合起来。

  • 在这里插入图片描述

Rectifier non-linearities (Fukushima, 1980) are inserted between all adjacent layers.
整流器非线性 (Fukushima, 1980) 〔 ReLU 〕 插入所有相邻层之间

We adopt the optimizers and hyper-parameters of van Hasselt et al. (2015), with the exception of the learning rate which we chose to be slightly lower (we do not do this for double DQN as it can deteriorate its performance).
我们采用 van Hasselt et al.(2015)的优化器和超参数,除了我们选择稍低的学习率(我们不会对 double DQN 这样做,因为会降低其性能)。
Since both the advantage and the value stream propagate gradients to the last convolutional layer in the backward pass, we rescale the combined gradient entering the last convolutional layer by 1 / 2 1/\sqrt2 1/2 .
由于优势流 和 价值流 在反向传递中都将梯度传播到最后一个卷积层,因此我们将进入最后一个卷积层的组合梯度重缩放 1 / 2 1/\sqrt2 1/2
This simple heuristic mildly increases stability.
这个简单的启发式略微增加了稳定性
In addition, we clip the gradients to have their norm less than or equal to 10.
此外,我们裁剪梯度,使其范数小于或等于 10。
This clipping is not standard practice in deep RL, but common in recurrent network training (Bengio et al., 2013).
这种裁剪在深度强化学习中不是标准做法,但在循环网络训练中很常见(Bengio et al., 2013)。

To isolate the contributions of the dueling architecture, we re-train DDQN with a single stream network using exactly the same procedure as described above.
为了隔离决斗架构的贡献,我们使用与上面描述的完全相同的过程,使用单个流网络重新训练 DDQN。
Specifically, we apply gradient clipping, and use 1024 hidden units for the first fully-connected layer of the network so that both architectures (dueling and single) have roughly the same number of parameters.
具体来说,我们应用梯度裁剪,并为网络的第一个全连接层使用 1024 个隐藏单元,以便两种架构(决斗 和 单个)具有大致相同数量的参数。
We refer to this re-trained model as Single Clip, while the original trained model of van Hasselt et al. (2015) is referred to as Single.
我们将这个重新训练的模型称为 Single Clip,而 van Hasselt et al.(2015)的原始训练模型称为 Single。

As in (van Hasselt et al., 2015), we start the game with up to 30 no-op actions to provide random starting positions for the agent.
在(van Hasselt et al., 2015)中,我们以多达 30 个无操作动作开始游戏,为 agent 提供随机起始位置。
To evaluate our approach, we measure improvement in percentage (positive or negative) in score over the better of human and baseline agent scores:
为了评估我们的方法,我们测量了高于人类和基线 agent 分数的提高分数百分比(正或负),:
~  
Score Agent − Score Baseline max ⁡ { Score Human , Score Baseline } − Score Random ( 10 ) \frac{\text{Score}_\text{Agent}-\text{Score}_\text{Baseline}}{\max\{\text{Score}_\text{Human},\text{Score}_\text{Baseline}\}-\text{Score}_\text{Random}}~~~~~~~~~~(10) max{ScoreHuman,ScoreBaseline}ScoreRandomScoreAgentScoreBaseline          (10)
~  
We took the maximum over human and baseline agent scores as it prevents insignificant changes to appear as large improvements when neither the agent in question nor the baseline are doing well.
我们取了人类和基线代理得分的最大值,因为它可以避免在问题代理和基线都做得不好的情况下,微不足道的变化出现大的改进。
For example, an agent that achieves 2% human performance should not be interpreted as two times better when the baseline agent achieves 1% human performance.
例如,当基线代理达到 1% 的人类性能时,不应该将达到 2% 人类性能的代理解释为两倍的性能。
We also chose not to measure performance in terms of percentage of human performance alone because a tiny difference relative to the baseline on some games can translate into hundreds of percent in human performance difference.
我们也选择不单独根据人类表现的百分比来衡量表现,因为在某些游戏中,相对于基线的微小差异可能会转化为人类表现的百分之百差异。

The results for the wide suite of 57 games are summarized in Table 1.
表 1 总结了 57 个游戏的结果。
Detailed results are presented in the Appendix.
详细结果见附录。

Using this 30 no-ops performance measure, it is clear that the dueling network (Duel Clip) does substantially better than the Single Clip network of similar capacity.
使用这 30 个无操作的性能度量,很明显,决斗网络 (Duel Clip) 比类似容量的 Single Clip 网络要好得多。
It also does considerably better than the baseline (Single) of van Hasselt et al. (2015).
它也比 van Hasselt et al.(2015)的基线 (Single) 要好得多。
For comparison we also show results for the deep Q-network of Mnih et al. (2015), referred to as Nature DQN.
为了比较,我们还展示了 Mnih et al. (2015) 的深度 Q -网络的结果,称为 Nature DQN.

Figure 4 shows the improvement of the dueling network over the baseline Single network of van Hasselt et al. (2015).
图 4 显示了决斗网络相对于 van Hasselt et al.(2015)的基线 Single 网络的改进。
Again, we seen that the improvements are often very dramatic.
再一次,我们看到改善是非常显著的。

在这里插入图片描述

Figure 4. Improvements of dueling architecture over the baseline Single network of van Hasselt et al. (2015), using the metric described in Equation (10).
图 4。van Hasselt 等人(2015)的基线 Single 网络上决斗架构的改进,使用式(10)中描述的度量。
Bars to the right indicate by how much the dueling network outperforms the single-stream network.
右边的条形图表示决斗网络比单流网络性能好多少。

As shown in Table 1, Single Clip performs better than Single.
We verified that this gain was mostly brought in by gradient clipping.
如表 1 所示,Single Clip 性能优于 Single。
我们验证了这个增益主要是由梯度裁剪带来的。
For this reason, we incorporate gradient clipping in all the new approaches.
出于这个原因,我们在所有的新方法中都加入了梯度裁剪。

Table 1. Mean and median scores across all 57 Atari games, measured in percentages of human performance.
表 1 所有 57 款 Atari 游戏的平均和中位数得分,以人类表现的百分比衡量。

在这里插入图片描述

Duel Clip does better than Single Clip on 75.4% of the games (43 out of 57).
Duel Clip 在 75.4% 的游戏中优于 Single Clip(57 个游戏中有 43 个)。
It also achieves higher scores compared to the Single baseline on 80.7% (46 out of 57) of the games.
它在 80.7% 的比赛中(57 场比赛中的 46 场)取得了比 Single 基线更高的分数。
Of all the games with 18 actions, Duel Clip is better 86.6% of the time (26 out of 30).
在所有有 18 个动作的游戏中,Duel Clip 的成功率为 86.6%(30 场比赛中的 26 场)。
This is consistent with the findings of the previous section.
这与前一节的结论一致。
Overall, our agent (Duel Clip) achieves human level performance on 42 out of 57 games.
总的来说,我们的 agent (Duel Clip) 在 57 场比赛中有 42 场达到了人类水平。
Raw scores for all the games, as well as measurements in human performance percentage, are presented in the Appendix.
所有游戏的原始分数,以及人类表现百分比的测量值都在附录中。

人类启动 (human starts) 的稳健性

Robustness to human starts.
One shortcoming of the 30 no-ops metric is that an agent does not necessarily have to generalize well to play the Atari games.
30 个无操作指标的一个缺点是,代理不一定要很好地泛化才能玩 Atari 游戏
Due to the deterministic nature of the Atari environment, from an unique starting point, an agent could learn to achieve good performance by simply remembering sequences of actions.
由于 Atari 环境的确定性,从特定的起点开始,代理可以通过简单地记住动作序列来学习获得良好的性能。

To obtain a more robust measure, we adopt the methodology of Nair et al. (2015).
为了获得更稳健的度量,我们采用了 Nair 等人(2015)的方法。
Specifically, for each game, we use 100 starting points sampled from a human expert’s trajectory.
具体来说,对于每个游戏,我们从人类专家的轨迹中采样 100 个起点
From each of these points, an evaluation episode is launched for up to 108,000 frames.
从每一个点开始,一个评估回合被启动,最多 108,000 帧。
The agents are evaluated only on rewards accrued after the starting point.
仅根据起始点后累积的奖励对代理进行评估。
We refer to this metric as Human Starts.
我们把这个指标称为 人类启动。

As shown in Table 1, under the Human Starts metric, Duel Clip once again outperforms the single stream variants.
如表 1 所示,在 Human Starts 指标下,Duel Clip 再次优于单流变体。
In particular, our agent does better than the Single baseline on 70.2% (40 out of 57) games and on games of 18 actions,Duel Clip is 83.3% better (25 out of 30).
特别地,我们的 agent 在 70.2%(57 场中的 40 场)的比赛中比 Single 基线表现更好,在 18 个动作的比赛中,83.3%(30 场中的 25 场)的比赛中 Duel Clip 比 Single 基线好 。

结合 优先经验回放

Combining with Prioritized Experience Replay.
The dueling architecture can be easily combined with other algorithmic improvements.
决斗架构可以很容易地与其他算法改进相结合。
In particular, prioritization of the experience replay has been shown to significantly improve performance of Atari games (Schaul et al., 2016).
特别地,经验回放的优先级 已被证明可以显著提高 Atari 游戏的性能(Schaul等人,2016)。
Furthermore, as prioritization and the dueling architecture address very different aspects of the learning process, their combination is promising.
此外,由于优先级 和 决斗架构 处理 学习过程中非常不同的方面,它们的结合是有希望的。
So in our final experiment, we investigate the integration of the dueling architecture with prioritized experience replay.
所以在我们最后的实验中,我们将研究决斗架构 与 优先经验回放的整合。
We use the prioritized variant of DDQN (Prior. Single) as the new baseline algorithm, which replaces with the uniform sampling of the experience tuples by rank-based prioritized sampling.
我们使用 DDQN (Prior)的优先级变体 (Prior. Single) 作为新的基线算法,用基于秩的优先抽样 取代 经验元组的均匀抽样
We keep all the parameters of the prioritized replay as described in (Schaul et al., 2016), namely a priority exponent of 0.7, and an annealing schedule on the importance sampling exponent from 0.5 to 1.
我们保留(Schaul et al., 2016)中描述的优先回放的所有参数,即优先级指数为 0.7,以及重要性采样指数从 0.5 到 1 的退火方案。
We combine this baseline with our dueling architecture (as above), and again use gradient clipping (Prior. Duel Clip).
我们将此基线 与 决斗架构(如上所述)结合起来,并再次使用梯度裁剪 (Prior. Duel Clip)。

Note that, although orthogonal in their objectives, these extensions (prioritization, dueling and gradient clipping) interact in subtle ways.
注意,尽管它们的目标是正交的,但这些扩展(优先级、决斗和梯度裁剪)以微妙的方式相互作用。
For example, prioritization interacts with gradient clipping, as sampling transitions with high absolute TD-errors more often leads to gradients with higher norms.
例如,优先级与梯度裁剪相互作用,因为具有 高绝对 TD 误差转换采样更经常导致具有更高梯度范数
To avoid adverse interactions, we roughly re-tuned the learning rate and the gradient clipping norm on a subset of 9 games.
为了避免不利的相互影响,我们大致重新调整了 9 个游戏子集的学习率和梯度裁剪范数norms。
As a result of rough tuning, we settled on 6.25 × 10-5 for the learning rate and 10 for the gradient clipping norm (the same as in the previous section).
作为粗略调优的结果,我们确定学习率为 6.25 × 1 0 − 5 10^{-5} 105,梯度裁剪范数为 10(与前一节相同)。

When evaluated on all 57 Atari games, our prioritized dueling agent performs significantly better than both the prioritized baseline agent and the dueling agent alone.
在所有 57 款 Atari 游戏上进行评估时,我们的优先级决斗 agent 的表现明显优于单独的 优先级基线代理 和 单独的决斗代理。
The full mean and median performance against the human performance percentage is shown in Table 1.
完整的平均和中位数性能与人类性能百分比 见表 1。
When initializing the games using up to 30 no-ops action, we observe mean and median scores of 591% and 172% respectively.
当使用多达 30 个无操作动作初始化游戏时,我们观察到平均分数为 591%,中位数分数为 172%。
The direct comparison between the prioritized baseline and prioritized dueling versions, using the metric described in Equation 10, is presented in Figure 5.
使用 式 10 中描述的度量,优先级基线 和 优先级决斗版本之间的直接比较如图 5 所示。
The combination of prioritized replay and the dueling network results in vast improvements over the previous stateof-the-art in the popular ALE benchmark.
优先级回放 和 决斗网络的结合,在流行的 ALE 基准测试中取得了巨大的进步。

在这里插入图片描述

Figure 5. Improvements of dueling architecture over Prioritized DDQN baseline, using the same metric as Figure 4.
图 5 与优先 DDQN 基线相比,决斗架构的改进,使用与图 4 相同的度量。
Again, the dueling architecture leads to significant improvements over the single-stream baseline on the majority of games.
再一次,决斗架构在大多数游戏的单流基线上带来了显著的改进。

显著性图

Saliency maps.
To better understand the roles of the value and the advantage streams, we compute saliency maps (Simonyan et al., 2013).
为了更好地理解价值流 和 优势流的作用,我们计算了显著性图(Simonyan et al., 2013)。
More specifically, to visualize the salient part of the image as seen by the value stream, we compute the absolute value of the Jacobian of V ^ \widehat V V with respect to the input frames: ∣ ∇ s V ^ ( s ; θ ) ∣ |\nabla_s\widehat V(s;\theta)| sV (s;θ).
更具体地说,为了可视化 价值流 所看到的图像的突出部分,我们计算 V ^ \widehat V V 关于输入帧的雅可比矩阵 的绝对值: ∣ ∇ s V ^ ( s ; θ ) ∣ |\nabla_s\widehat V(s;\theta)| sV (s;θ)
Similarly, to visualize the salient part of the image as seen by the advantage stream, we compute ∣ ∇ s A ^ ( s ; arg ⁡ max ⁡ a ′ A ^ ( s , a ′ ) ; θ ) ∣ \Big|\nabla_s\widehat A\Big(s;\arg \max_{a^\prime}\widehat A(s,a^\prime);\theta \Big)\Big| sA (s;argmaxaA (s,a);θ) .
类似地,为了可视化 优势流所看到的图像的突出部分,我们计算 ∣ ∇ s A ^ ( s ; arg ⁡ max ⁡ a ′ A ^ ( s , a ′ ) ; θ ) ∣ \Big|\nabla_s\widehat A\Big(s;\arg \max_{a^\prime}\widehat A(s,a^\prime);\theta \Big)\Big| sA (s;argmaxaA (s,a);θ)
Both quantities are of the same dimensionality as the input frames and therefore can be visualized easily alongside the input frames.
这两个量与输入帧具有相同的维度,因此可以很容易地在输入帧旁边可视化。

Here, we place the gray scale input frames in the green and blue channel and the saliency maps in the red channel.
在这里,我们将灰度输入帧 放在绿色和蓝色通道中,并将显著性图放在红色通道中。
All three channels together form an RGB image.
所有三个通道一起形成 RGB 图像。
Figure 2 depicts the value and advantage saliency maps on the Enduro game for two different time steps.
图 2 描述了 Enduro 游戏在两个不同时间步下的价值和优势显著性地图。
As observed in the introduction, the value stream pays attention to the horizon where the appearance of a car could affect future performance.
正如在引言中所观察到的,价值流关注的是汽车外观对未来性能的影响。
The value stream also pays attention to the score.
价值流也关注分数。
The advantage stream, on the other hand, cares more about cars that are on an immediate collision course.
另一方面,优势流更关心那些即将发生碰撞的汽车。

在这里插入图片描述

Figure 2.
See, attend and drive: Value and advantage saliency maps (red-tinted overlay) on the Atari game Enduro, for a trained dueling architecture.
观看、参与和驾驶:Atari 游戏 Enduro 中的价值和优势显著性地图(红色覆盖),用于训练好的决斗架构。
The value stream learns to pay attention to the road.
价值流学会了关注道路。
The advantage stream learns to pay attention only when there are cars immediately in front, so as to avoid collisions.
优势流只在前方有汽车时才学会注意,以避免碰撞。

5. 讨论

The advantage of the dueling architecture lies partly in its ability to learn the state-value function efficiently.
决斗结构的优势部分在于它能够有效地学习 状态价值函数。
With every update of the Q values in the dueling architecture, the value stream V is updated-this contrasts with the updates in a single-stream architecture where only the value for one of the actions is updated, the values for all other actions remain untouched.
随着决斗架构中 Q 值的每次更新,价值流 V 也会更新——这与单流体系结构中的更新形成对比,在单流体系结构中,只有一个动作的价值被更新,所有其他动作的价值保持不变
This more frequent updating of the value stream in our approach allocates more resources to V, and thus allows for better approximation of the state values, which in turn need to be accurate for temporal difference-based methods like Q-learning to work (Sutton & Barto, 1998).
在我们的方法中,这种更频繁的价值流更新为 V 分配了更多的资源,从而允许更好地近似状态价值,这反过来又需要准确的基于时序差分的方法,如 Q-learning 的工作(Sutton & Barto, 1998)。
This phenomenon is reflected in the experiments, where the advantage of the dueling architecture over single-stream Q networks grows when the number of actions is large.
这种现象反映在实验中,当动作数量很大时,决斗架构相对于单流 Q 网络的优势就会增加。

Furthermore, the differences between Q-values for a given state are often very small relative to the magnitude of Q.
此外,相对于 Q 的大小,给定状态下 Q 值之间的差异通常非常小。
For example, after training with DDQN on the game of Seaquest, the average action gap (the gap between the Q values of the best and the second best action in a given state) across visited states is roughly 0.04, whereas the average state value across those states is about 15.
例如,在 Seaquest 游戏中使用 DDQN 进行训练后,访问状态的平均动作差距(给定状态下最佳 和 第二最佳动作的 Q 值之间的差距)大约为 0.04,而这些状态的平均状态价值约为 15。
This difference in scales can lead to small amounts of noise in the updates can lead to reorderings of the actions, and thus make the nearly greedy policy switch abruptly.
这种尺度上的差异可能导致更新中的少量噪音,可能导致动作的重新排序,从而使近乎贪心的策略突然切换。
The dueling architecture with its separate advantage stream is robust to such effects.
具有独立优势流的决斗结构对这种影响具有较强的鲁棒性。

6. 结论

We introduced a new neural network architecture that decouples value and advantage in deep Q-networks, while sharing a common feature learning module.
我们介绍了一种新的神经网络架构,将深度 Q 网络中的价值 和 优势解耦,同时共享一个共同的特征学习模块。
The new dueling architecture, in combination with some algorithmic improvements, leads to dramatic improvements over existing approaches for deep RL in the challenging Atari domain.
新的决斗架构与一些算法改进相结合,在具有挑战性的 Atari 领域对现有的深度强化学习方法进行了巨大的改进。
The results presented in this paper are the new state-of-the-art in this popular domain.
本文展现的结果是这一流行领域新的最先进水平。

参考文献

附录

A. Double DQN Algorithm
在这里插入图片描述

Table 2. Raw scores across all games. Starting with 30 no-op actions.
表 2 所有游戏的原始分数。从 30 个无操作 动作开始。

仅部分

在这里插入图片描述

Table 3. Raw scores across all games. Starting with Human starts.
表 3 所有游戏的原始分数。以人类启动开始。

仅部分

在这里插入图片描述

Table 4. Normalized scores across all games. Starting with 30 no-op actions.
表 4 所有游戏的标准化分数。从 30 个无操作 动作开始。

仅部分

在这里插入图片描述

Table 5. Normalized scores across all games. Starting with Human Starts.
表 5 所示。所有游戏的标准化分数。以人类启动开始。

仅部分

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/57036.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

OpenCV高级图形用户界面(13)选择图像中的一个矩形区域的函数selectROI()的使用

操作系统:ubuntu22.04 OpenCV版本:OpenCV4.9 IDE:Visual Studio Code 编程语言:C11 算法描述 允许用户在给定的图像上选择一个感兴趣区域(ROI)。 该功能创建一个窗口,并允许用户使用鼠标来选择一个 ROI。…

其他css的用途

1.animation-fill-mode: backwards; //避免了在动画开始前元素的突然显现,动画必要。 2.用rem响应式字体大小,可以在html样式定义font-size?(例10px,62.5%(100%是16px))。然后样式就可以用rem代替px。 3.color: transparent;: 这行代码将文…

计算生物学与生物信息学漫谈-2-测序深度/读长质量和Fasta处理

上一篇文章中我们介绍了测序技术的由来与发展,那么在介绍第三代测序的时候,我们提到了关于测序深度和读长的问题,那么本篇文章就详解介绍一下。 计算生物学与生物信息学漫谈-1-测序一路走来-CSDN博客 目录 1.测序深度SEQUENCING DEPTH &…

《AI生成式工具使用》之:自助生成视频

目录 背景说明及目标: 实现过程: 1、有问题找度娘 2、利用剪映AI生成视频具体步骤 剪映AI感受 3、利用万彩AI生成视频具体步骤 万彩AI感受 4、利用腾讯云剪生成视频具体步骤 腾讯云剪感受 最终结论: 关注我,躺不平就一起…

【部署篇】RabbitMq-02单机模式部署

RabbitMQ和Erlang/OTP兼容性矩阵 下表提供了当前支持的RabbitMQ版本系列的Erlang兼容性矩阵。更多RabbitMQ版本,请参阅官网的系列兼容性列表。官网地址:https://www.rabbitmq.com/docs/which-erlang RabbitMQ版本最小支持版本最大支持版本备注 4.0.24.…

Axure重要元件三——中继器添加数据

亲爱的小伙伴,在您浏览之前,烦请关注一下,在此深表感谢! 本节课:中继器添加数据 课程内容:添加数据项、自动添加序号、自动添加数据汇总 应用场景:表单数据的添加 案例展示: 步骤…

经验是最坏的老师

奥斯卡.王尔德说过:经验是最坏的老师。他经常先考试,然后再给出指导。 这让我想起了另外一句话:愚笨的人,往往都在犯同样的错误;普通的人,从自己的错误中学习;聪明人从别人的错误中学习。 如果…

Linux 防火墙的开启、关闭、禁用命令

Linux 防火墙的开启、关闭、禁用命令 文章目录 Linux 防火墙的开启、关闭、禁用命令1.设置开机启用防火墙2.设置开机禁用防火墙3.启动防火墙4.关闭防火墙5.检查防火墙状态 1.设置开机启用防火墙 systemctl enable firewalld.service2.设置开机禁用防火墙 systemctl disable f…

006、链表分割

0、题目描述 链表分割 这道题的思路,遍历原链表,小于x的放到一个链表里,大于x的放到另一个链表里。然后把两个链表接起来。 建立的两个新链表都是有哨兵位的,也就是有头结点,排序结束后要free两个头结点。 1、法1 还…

CSS3 提示框带边角popover

CSS3 提示框带边角popover。因为需要绝对定位子元素&#xff08;这里就是伪元素&#xff09;&#xff0c;所以需要将其设置为相对对位 <!DOCTYPE html> <html> <head> <title>test1.html</title> <meta name"keywords" con…

格点拉格朗日插值与PME算法

技术背景 在前面的一篇博客中&#xff0c;我们介绍了拉格朗日插值法的基本由来和表示形式。这里我们要介绍一种拉格朗日插值法的应用场景&#xff1a;格点拉格朗日插值法。这种场景的优势在于&#xff0c;如果我们要对整个实数空间进行求和或者积分&#xff0c;计算量是随着变量…

JDK中socket源码解析

目录 1、Java.net包 1. Socket通信相关类 2. URL和URI处理类 3. 网络地址和主机名解析类 4. 代理和认证相关类 5. 网络缓存和Cookie管理类 6. 其他网络相关工具类 2、什么是socket&#xff1f; 3、JDK中socket核心Api 4、核心源码 1、核心方法 2、本地方法 3、lin…

SQL Server 2019数据库“正常,已自动关闭”

现象&#xff1a; SQL Server 2019中&#xff0c;某个数据库在SQL Server Management Studio&#xff08;SSMS&#xff09;中的状态显示为“正常&#xff0c;已自动关闭”。 解释&#xff1a; 如此显示&#xff0c;是由于该数据库的AUTO_ CLOSE选项被设为True。 在微软的官…

基于 Konva 实现Web PPT 编辑器(三)

完善公式 上一节我们简单讲述了公式的使用&#xff0c;并没有给出完整的样例&#xff0c;下面还是完善下相关步骤&#xff0c;我们是默认支持公式的编辑功能的哈&#xff0c;因此&#xff0c;我们只需要提供必要的符号即可&#xff1a; 符号所表达的含义是 mathlive 的command命…

电力系统IEC-101报文主要常用详解

文章目录 1️⃣ IEC-1011.1 前言1.2 101规约简述1.3 固定帧格式1.4 可变帧格式1.5 ASDU1.5.1 常见类型标识1.5.2 常见结构限定词1.5.3 常见传送原因1.5.4 信息体地址 1.6 常用功能报文1.6.1 初始化链路报文1.6.2 总召报文1.6.3 复位进程1.8.4 对时1.8.4.1时钟读取1.8.4.2时钟写…

适用于 vue react Es6 jQuery 等等的组织架构图(组织结构图)

我这里找的是 OrgChart 插件; 地址: GitHub - dabeng/OrgChart: Its a simple and direct organization chart plugin. Anytime you want a tree-like chart, you can turn to OrgChart. 这里面能满足你对组织架构图的一切需求! ! ! 例: 按需加载 / 拖拽 / 编辑 / 自定义 / …

基于STM32F407VGT6芯片----跑马灯实验

一、在STM32F407VGT6芯片中配置GPIO环境 对于一个跑马灯实验&#xff0c;首先&#xff0c;要了解的就是&#xff0c;芯片是如何构造出来的&#xff0c;设计GPIO引脚&#xff1a;根据原理图&#xff0c; PC4&#xff0c;PC5,PC6,PC7 为 LED 输出控制管脚&#xff0c;PE0 为蜂鸣…

机器学习面试笔试知识点-线性回归、逻辑回归(Logistics Regression)和支持向量机(SVM)

机器学习面试笔试知识点-线性回归、逻辑回归Logistics Regression和支持向量机SVM 一、线性回归1.线性回归的假设函数2.线性回归的损失函数(Loss Function)两者区别3.简述岭回归与Lasso回归以及使用场景4.什么场景下用L1、L2正则化5.什么是ElasticNet回归6.ElasticNet回归的使…

嵌套div导致子区域margin失效问题解决

嵌套div导致子区域margin失效问题解决 现象原因解决方法 现象 <div class"prev"></div> <div class"parent"><div class"child"></div><div class"child"></div> </div> <div cl…

cisco网络安全技术第3章测试及考试

测试 使用本地数据库保护设备访问&#xff08;通过使用 AAA 中央服务器来解决&#xff09;有什么缺点&#xff1f; 试题 1选择一项&#xff1a; 必须在每个设备上本地配置用户帐户&#xff0c;是一种不可扩展的身份验证解决方案。 请参见图示。AAA 状态消息的哪一部分可帮助…