梯度反传_反事实政策梯度解释

梯度反传

Among many of its challenges, multi-agent reinforcement learning has one obstacle that is overlooked: “credit assignment.” To explain this concept, let’s first take a look at an example…

在许多挑战中,多主体强化学习有一个被忽略的障碍:“学分分配”。 为了解释这个概念,让我们首先看一个例子……

Say we have two robots, robot A and robot B. They are trying to collaboratively push a box into a hole. In addition, they both receive a reward of 1 if they push it in and 0 otherwise. In the ideal case, the two robots would both push the box towards the hole at the same time, maximizing the speed and efficiency of the task.

假设我们有两个机器人,即机器人A和机器人B。他们正在尝试将盒子推入一个洞中。 此外,如果他们都将其推入,他们都将获得1的奖励,否则将获得0。 在理想情况下,两个机器人都将盒子同时推向Kong,从而最大程度地提高了任务的速度和效率。

However, suppose that robot A does all the heavy lifting, meaning robot A pushes the box into the hole while robot B stands idly on the sidelines. Even though robot B simply loitered around, both robot A and robot B would receive a reward of 1. In other words, the same behavior is encouraged later on even though robot B executed a suboptimal policy. This is when the issue of “credit assignment” comes in. In multi-agent systems, we need to find a way to give “credit” or reward to agents who contribute to the overall goal, not to those who let others do the work.

但是,假设机器人A完成了所有繁重的工作,这意味着机器人A将箱子推入Kong中,而机器人B空着站在边线上。 即使机器人B只是闲逛, 机器人A 机器人B都将获得1的奖励。换句话说, 即使机器人B执行了次优策略,以后也会鼓励相同的行为 这就是“信用分配”问题出现的时候。在多主体系统中,我们需要找到一种方法,向为总体目标做出贡献的代理人而不是让他人完成工作的代理人给予“信用”或奖励。 。

Okay so what’s the solution? Maybe we only give rewards to agents who contribute to the task itself.

好的,那有什么解决方案? 也许我们只奖励那些为任务本身做出贡献的特工。

Image for post
Photo by Kira auf der Heide on Unsplash
图片由Kira auf der Heide 摄于Unsplash

比看起来难 (It’s Harder than It Seems)

It seems like this easy solution may just work, but we have to keep several things in mind.

似乎这个简单的解决方案可能会奏效,但我们必须牢记几件事。

First, state representation in reinforcement learning might not be expressive enough to properly tailor rewards like this. In other words, we can’t always easily quantify whether an agent contributed to a given task and dole out rewards accordingly.

首先,强化学习中的状态表示可能不足以适当地调整这样的奖励。 换句话说,我们不能总是轻松地量化代理商是否为给定任务做出贡献并相应地发放奖励。

Secondly, we don’t want to handcraft these rewards, because it defeats the purpose of designing multi-agent algorithms. There’s a fine line between telling agents how to collaborate and encouraging them to learn how to do so.

其次,我们不想手工获得这些奖励,因为它违背了设计多主体算法的目的。 在告诉代理人如何合作与鼓励他们学习如何做之间有一条很好的界限。

一个答案 (One Answer)

Counterfactual policy gradients address this issue of credit assignment without explicitly giving away the answer to its agents.

反事实的政策梯度解决了这一信用分配问题,而没有向其代理商明确给出答案。

The main idea behind the approach? Let’s train agent policies by comparing its actions to other actions it could’ve taken. In other words, an agent will ask itself:

该方法背后的主要思想是什么? 让我们通过将代理的操作与它可能采取的其他操作进行比较来训练代理策略。 换句话说,座席会问自己:

“ Would we have gotten more reward if I had chosen a different action?”

“如果我选择其他动作, 我们会得到更多的回报吗?”

By putting this thinking process into mathematics, counterfactual multi-agent (COMA) policy gradients tackle the issue of credit assignment by quantifying how much an agent contributes to completing a task.

通过将这种思维过程纳入数学,反事实多主体(COMA)策略梯度通过量化代理对完成任务的贡献来解决信用分配问题。

Image for post
Photo by Brandon Mowinkel on Unsplash
照片由Brandon Mowinkel在Unsplash上拍摄

组成部分 (The Components)

COMA is an actor-critic method that uses centralized learning with decentralized execution. This means we train two networks:

COMA是一种参与者批评方法,它使用集中式学习和分散式执行。 这意味着我们训练两个网络:

  • An actor: given a state, outputs an action

    演员 :给定状态,输出动作

  • A critic: given a state, estimates a value function

    评论家 :给定状态,估计价值函数

In addition, the critic is only used during training and is removed during testing. We can think of the critic as the algorithm’s “training wheels.” We use the critic to guide the actor throughout training and give it advice on how to update and learn its policies. However, we remove the critic when it’s time to execute the actor’s learned policies.

此外,注释器仅在训练期间使用,而在测试期间被删除 。 我们可以将批评者视为算法的“训练轮”。 我们使用评论家在整个培训过程中指导演员,并为演员提供有关如何更新和学习其政策的建议。 但是,在执行演员的学习策略时,我们会删除批评者。

For more background on actor-critic methods in general, take a look at Chris Yoon’s in-depth article here:

要获得有关演员批评方法的更多背景知识,请在此处查看Chris Yoon的深入文章:

Let’s start by taking a look at the critic. In this algorithm, we train a network to estimate the joint Q-value across all agents. We’ll discuss the critic’s nuances and how it’s specifically designed later in this article. However, all we need to know now is that we have two copies of the critic network. One is the network we are trying to train and the other is our target network, used for training stability. The target network’s parameters are copied from the training network periodically.

让我们先看一下评论家。 在此算法中,我们训练网络以估计所有代理之间的联合Q值 。 我们将在本文后面讨论评论家的细微差别以及它是如何专门设计的。 但是,我们现在需要知道的是,我们有批评者网络的两个副本。 一个是我们正在尝试训练的网络,另一个是我们用于训练稳定性的目标网络。 定期从训练网络复制目标网络的参数。

To train the networks, we use on-policy training. Instead of using one-step or n-step lookahead to determine our target Q-values, we use TD(lambda), which uses a mixture of n-step returns.

为了训练网络,我们使用了策略训练。 我们使用TD(lambda)而不是使用单步或n步前瞻来确定目标Q值,而是使用n步返回值的混合。

Image for post
n-step returns and target value using TD (lambda)
使用TD(λ)的n步返回和目标值

where gamma is the discount factor, r denotes a reward at a specific time step, f is our target value function, and lambda is a hyper-parameter. This seemingly infinite horizon value is calculated using bootstrapped estimates by a target network.

其中gamma是折现因子,r表示在特定时间步长的奖励,f是我们的目标值函数,lambda是超参数。 这个看似无限的地平线值是由目标网络使用自举估计来计算的。

For more information on TD(lambda), Andre Violante’s article provides a fantastic explanation:

有关TD(lambda)的更多信息, Andre Violante的文章提供了一个奇妙的解释:

Finally, we update the critic’s parameters by minimizing this function:

最后,我们通过最小化此函数来更新评论者的参数:

Image for post
Loss function
损失函数
Image for post
Photo by Jose Morales on Unsplash
Jose Morales在Unsplash上拍摄的照片

赶上 (The Catch)

Now, you may be wondering: this is nothing new! What makes this algorithm special? The beauty behind this algorithm comes with how we update the actor networks’ parameters.

现在,您可能想知道:这不是什么新鲜事! 是什么使该算法与众不同? 该算法背后的美在于我们如何更新角色网络的参数。

In COMA, we train a probabilistic policy, meaning each action in a given state is chosen with a specific probability that is changed throughout training. In typical actor-critic scenarios, we update the policy by using a policy gradient, typically using the value function as a baseline to create advantage actor-critic:

在COMA中,我们训练概率策略,这意味着在给定状态下的每个动作都以特定概率选择,该概率在整个训练过程中都会改变。 在典型的参与者批评者场景中,我们通过使用策略梯度来更新策略,通常使用价值函数作为基准来创建优势参与者批评者:

Image for post
Naive advantage actor critic policy update
天真优势演员评论家政策更新

However, there’s a problem here. This fails to address the original issue we were trying to solve: “credit assignment.” We have no notion of “how much any one agent contributes to the task.” Instead, all agents are being given the same amount of “credit,” considering our value function estimates joint value functions. As a result, COMA proposes using a different term as our baseline.

但是,这里有一个问题。 这无法解决我们试图解决的原始问题:“信用分配”。 我们没有“任何一个特工为这项任务做出多少贡献”的概念。 取而代之的是,考虑到我们的价值函数估算联合价值函数 ,所有代理商都会获得相同数量的“信用”。 因此,COMA建议使用其他术语作为我们的基准。

To calculate this counterfactual baseline for each agent, we calculate an expected value over all actions that agent can take while keeping the actions of all other agents fixed.

为了计算每个业务代表的反事实基准我们在保持所有其他业务代表的动作不变的情况下计算了该业务代表可以采取的所有行动的期望值。

Image for post
Adding counterfactual baseline to advantage function estimate
将反事实基准添加到优势函数估计中

Let’s take a step back here and dissect this equation. The first term is just the Q-value associated with the joint state and joint action (all agents). The second term is an expected value. Looking at each individual term in that summation, there are two values being multiplied together. The first is the probability this agent would’ve chosen a specific action. The second is the Q-value of taking that action while all other agents kept their actions fixed.

让我们退后一步,剖析这个方程。 第一项只是与关节状态和关节动作(所有主体)相关的Q值。 第二项是期望值。 看一下该求和中的每个单独的项,有两个值相乘在一起。 首先是该特工选择特定动作的可能性。 第二个是在所有其他代理保持其动作固定的同时执行该动作的Q值。

Now, why does this work? Intuitively, by using this baseline, the agent knows how much reward this action contributes relative to all other actions it could’ve taken. In doing so, it can better distinguish which actions will better contribute to the overall reward across all agents.

现在,为什么这样做呢? 凭直觉,通过使用此基准,代理可以知道此操作相对于它可能已经执行的所有其他操作有多少奖励。 这样,它可以更好地区分哪些行为将更好地为所有代理提供总体奖励。

COMA proposes using a specific network architecture helps make computing the baseline more efficient [1]. Furthermore, the algorithm can be extended to continuous action spaces by estimating the expected value using Monte Carlo Samples.

COMA提出使用特定的网络体系结构有助于使基准线的计算效率更高[1]。 此外,通过使用蒙特卡洛样本估计期望值,可以将该算法扩展到连续动作空间。

Image for post
Photo by JESHOOTS.COM on Unsplash
JESHOOTS.COM在Unsplash上的照片

结果 (Results)

COMA was tested on StarCraft unit micromanagement, pitted against various central and independent actor critic variations, estimating both Q-values and value functions. It was shown that the approach outperformed others significantly. For official reported results and analysis, check out the original paper [1].

COMA已在StarCraft单位的微观管理上进行了测试,与各种中央和独立演员评论家的变化进行了对比,从而估算了Q值和值函数。 结果表明,该方法明显优于其他方法。 有关官方报告的结果和分析,请查看原始论文[1]。

结论 (Conclusion)

Nobody likes slackers. Neither do robots.

没有人喜欢懒人。 机器人也没有。

Properly allowing agents to recognize their personal contribution to a task and optimizing their policies to best use this information is an essential part of making robots collaborate. In the future, better decentralized approaches may be explored, effectively lowering the learning space exponentially. However, this is easier said than done, as with all problems of these sorts. But of course, this is a strong milestone to letting multi-agents function at a far higher, more complex level.

适当地允许代理识别他们对任务的个人贡献并优化其策略以最佳地利用此信息,这是使机器人进行协作的重要组成部分。 将来,可能会探索更好的分散方法,从而有效地减少学习空间。 但是,对于所有这些问题,说起来容易做起来难。 但是,当然,这是使多代理在更高,更复杂的级别上起作用的重要里程碑。

From the classic to state-of-the-art, here are related articles discussing both multi-agent and single-agent reinforcement learning:

从经典到最新,以下是讨论多主体和单主体增强学习的相关文章:

翻译自: https://towardsdatascience.com/counterfactual-policy-gradients-explained-40ac91cef6ae

梯度反传

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390789.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

大数据与Hadoop

大数据的定义 大数据是指无法在一定时间内用常规软件工具对其内容进行抓取、管理和处理的数据集合。 大数据的概念–4VXV 1,数据量大(Volume)2,类型繁多(Variety )3,速度快时效高(Velocity)4,价值密度低…

facebook.com_如何降低电子商务的Facebook CPM

facebook.comWith the 2020 election looming, Facebook advertisers and e-commerce stores are going to continually see their ad costs go up as the date gets closer (if they haven’t already).随着2020年选举的临近,随着日期越来越近,Facebook…

Hadoop安装及配置

Hadoop的三种运行模式 单机模式(Standalone,独立或本地模式):安装简单,运行时只启动单个进程,仅调试用途;伪分布模式(Pseudo-Distributed):在单节点上同时启动namenode、datanode、secondarynamenode、resourcemanage…

漏洞发布平台-安百科技

一个不错的漏洞发布平台:https://vul.anbai.com/ 转载于:https://blog.51cto.com/antivirusjo/2093758

西格尔零点猜想_我从埃里克·西格尔学到的东西

西格尔零点猜想I finished reading Eric Siegel’s Predictive Analytics. And I have to say it was an awesome read. How do I define an awesome or great book? A book that changes your attitude permanently. You must not be the same person that you were before y…

HDFS 技术

HDFS定义 Hadoop Distributed File System,是一个使用 Java 实现的、分布式的、可横向扩展的文件系 统,是 HADOOP 的核心组件 HDFS特点 处理超大文件流式地访问数据运行于廉价的商用机器集群上; HDFS 不适合以下场合:低延迟数据…

深度学习算法和机器学习算法_啊哈! 4种流行的机器学习算法的片刻

深度学习算法和机器学习算法Most people are either in two camps:大多数人都在两个营地中: I don’t understand these machine learning algorithms. 我不了解这些机器学习算法。 I understand how the algorithms work, but not why they work. 我理解的算法是如…

Python第一次周考(0402)

2019独角兽企业重金招聘Python工程师标准>>> 一、单选 1、Python3中下列语句错误的有哪些? A s input() B s raw_input() C print(hello world.) D print(hello world.) 2、下面哪个是 Pycharm 在 Windows 下 默认 用于“批量注释”的快捷键 A Ctrl d…

ASP.NET 页面之间传值的几种方式

对于任何一个初学者来说,页面之间传值可谓是必经之路,却又是他们的难点。其实,对大部分高手来说,未必不是难点。 回想2016年面试的将近300人中,有实习生,有应届毕业生,有1-3年经验的&#xff0c…

Mapreduce原理和YARN

MapReduce定义 MapReduce是一种分布式计算框架,由Google公司2004年首次提出,并贡献给Apache基金会。 MR版本 MapReduce 1.0,Hadoop早期版本(只支持MR模型)MapReduce 2.0,Hadoop 2.X版本(引入了YARN资源调度框架后&a…

数据可视化图表类型_数据可视化中12种最常见的图表类型

数据可视化图表类型In the current era of large amounts of information in the form of numbers available everywhere, it is a difficult task to understand and get insights from these dense piles of data.在当今时代,到处都是数字形式的大量信息&#xff…

MapReduce编程

自定义Mapper类 class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> &#xff5b; … }自定义mapper类都必须实现Mapper类&#xff0c;有4个类型参数&#xff0c;分别是&#xff1a; Object&#xff1a;Input Key Type-------------K1Text: Input…

统计信息在数据库中的作用_统计在行业中的作用

统计信息在数据库中的作用数据科学与机器学习 (DATA SCIENCE AND MACHINE LEARNING) Statistics are everywhere, and most industries rely on statistics and statistical thinking to support their business. The interest to grasp on statistics also required to become…

IOS手机关于音乐自动播放问题的解决办法

2019独角兽企业重金招聘Python工程师标准>>> 评估手机自带浏览器不能识别 aduio标签重的autoplay属性 也不能自动执行play()方法 一个有效的解决方案是在微信jssdk中调用play方法 document.addEventListener("WeixinJSBridgeReady", function () { docum…

开发人员怎么看实施人员

英文原文&#xff1a;What Developers Think Of Operations&#xff0c;翻译&#xff1a;张红月CSDN 在一个公司里面&#xff0c;开发和产品实施对于IS/IT的使用是至关重要的&#xff0c;一个负责产品的研发工作&#xff0c;另外一个负责产品的安装、调试等工作。但是在开发人员…

怎么评价两组数据是否接近_接近组数据(组间)

怎么评价两组数据是否接近接近组数据(组间) (Approaching group data (between-group)) A typical situation regarding solving an experimental question using a data-driven approach involves several groups that differ in (hopefully) one, sometimes more variables.使…

代码审计之DocCms漏洞分析

0x01 前言 DocCms[音译&#xff1a;稻壳Cms] &#xff0c;定位于为企业、站长、开发者、网络公司、VI策划设计公司、SEO推广营销公司、网站初学者等用户 量身打造的一款全新企业建站、内容管理系统&#xff0c;服务于企业品牌信息化建设&#xff0c;也适应用个人、门户网站建设…

翻译(九)——Clustered Indexes: Stairway to SQL Server Indexes Level 3

原文链接&#xff1a;www.sqlservercentral.com/articles/StairwaySeries/72351/ Clustered Indexes: Stairway to SQL Server Indexes Level 3 By David Durant, 2013/01/25 (first published: 2011/06/22) The Series 本文是阶梯系列的一部分&#xff1a;SQL Server索引的阶梯…

power bi 中计算_Power BI中的期间比较

power bi 中计算Just recently, I’ve come across a question on the LinkedIn platform, if it’s possible to create the following visualization in Power BI:就在最近&#xff0c;我是否在LinkedIn平台上遇到了一个问题&#xff0c;是否有可能在Power BI中创建以下可视化…

-Hive-

Hive定义 Hive 是一种数据仓库技术&#xff0c;用于查询和管理存储在分布式环境下的大数据集。构建于Hadoop的HDFS和MapReduce上&#xff0c;用于管理和查询分析结构化/非结构化数据的数据仓库; 使用HQL&#xff08;类SQL语句&#xff09;作为查询接口&#xff1b;使用HDFS作…