它们是什么以及为什么我们不需要它们

Once in a while, when reading papers in the Reinforcement Learning domain, you may stumble across mysterious-sounding phrases such as ‘we deal with a filtered probability space’, ‘the expected value is conditional on a filtration’ or ‘the decision-making policy is ℱ-measurable’. Without formal training in measure theory [2,3], it may be difficult to grasp what exactly such a filtration entails. Formal definitions look something like this:

有时,在阅读强化学习领域的论文时,您可能会偶然发现一些听起来很神秘的短语,例如“我们处理过滤后的概率空间” ,“ 期望值取决于过滤条件 ”或“ 决策政策是“可 衡量的 ”。 没有对度量理论的正式训练[2,3],可能很难掌握这种过滤到底需要什么。 正式定义看起来像这样:

Image for post
Formal definition of a filtration ([2], own work)
过滤的正式定义([2],自己的工作)

Boilerplate language for those familiar with measure theory, no doubt, but hardly helpful otherwise. Googling for answers likely leads through a maze of σ-algebras, Borel sets, Lebesgue measures and Hausdorff spaces, again presuming that one already knows the basics. Fortunately, only a very basic understanding of a filtration is needed to grasp its implications within the RL domain. This article will provide far from a full discussion on the topic, but aims to give a brief and (hopefully) intuitive outline of the core concept.

毫无疑问,对于那些熟悉度量理论的人来说,这些样板语言非常有用,但对于其他方面几乎没有帮助。 谷歌搜索答案可能是通过σ-代数,Borel集,Lebesgue测度和Hausdorff空间的迷宫而导致的,再次假设人们已经知道基础知识。 幸运的是,只需要对过滤有一个非常基本的了解即可掌握其在RL域中的含义。 本文将不提供有关该主题的完整讨论,而旨在给出核心概念的简短(希望)直观概述。

An example

一个例子

In RL, we typically define an outcome space Ω that contains all possible outcomes or samples that may occur, with ω being a specific sample path. For the sake of illustration, we will assume that our RL problem relates to a stock with price Sₜ at day t . Of course we’d like to buy low and sell high (the precise decision-making problem is irrelevant here). We might denote the buying/selling decision as xₜ(ω), i.e., the decision is conditional on the price path. We start with price S₀ (a real number) and every day the price goes up or down according to some random process. We may simulate (or mathematically define) such a price path ω=[ω₁,…,ωₜ] up front, before running the episode. However, that does not mean we should know stock price movements before they actually happen — even Warren Buffett could only dream of having such information! To claim we base our decisions on ω without being a clairvoyant, we may state the outcome space is ‘filtered’ (using the symbol ℱ ) meaning we can only observe the sample up to time t.

在RL中,我们通常定义一个结果空间Ω ,其中包含所有可能的结果或可能发生的样本,其中ω是特定的样本路径。 为了便于说明,我们假设RL问题与在t天价格为S a的股票有关。 当然,我们想买低卖高(确切的决策问题在这里无关紧要)。 我们可以将买卖决策表示为xₜ(ω) 即,决定取决于价格路径。 我们从价格S₀ (一个实数)开始,然后每天价格都会根据某种随机过程上升或下降。 在运行情节之前,我们可以预先模拟(或数学定义)这样的价格路径ω= [ω₁,…,ωₜ] 。 但是,这并不意味着我们不应该在股价实际发生之前就知道它们的波动-甚至沃伦·巴菲特也只能梦想拥有这样的信息! 为了声明我们的决定基于ω而并非千篇一律 ,我们可以声明结果空间已被“过滤”(使用符号ℱ),这意味着我们只能观察到时间t之前的样本。

For most RL practitioners, this restriction must sound familiar. Don’t we usually base our decisions on the current state Sₜ? Indeed, we do. In fact, as the Markov property implies that the stochastic process is memoryless — we only need the information embedded in the prevailing state Sₜ — information from the past is irrelevant [5]. As we will shortly see, the filtration is richer and more generic than a state, yet for practical purposes their implications are similar.

对于大多数RL从业者来说,此限制必须听起来很熟悉。 我们通常不是根据当前状态Sₜ做出决定吗? 确实,我们做到了。 实际上,正如马尔可夫性质所暗示的那样,随机过程是无记忆的-我们只需要嵌入盛行状态Sₜ中的信息-过去的信息就无关紧要[5]。 我们将很快看到,过滤比状态更丰富,更通用,但是出于实际目的,它们的含义是相似的。

Let’s formalize our stock price problem a bit more. We start with a discrete problem setting, in which the price either goes up (u) or down (-d). Considering an episode horizon of three days, the outcome space Ω may be visualized by a binomial lattice [4]:

让我们进一步规范一下股价问题。 我们从一个离散的问题设置开始,在该问题中,价格上涨( u )或下跌( -d )。 考虑到三天的发作期,结果空间Ω可以通过二项式格子[4]可视化:

Image for post
mage by author)作者作图 )

Definition of events and filtrations

事件和过滤的定义

At this point, we need to define the notion of an `event’ A ∈ Ω. Perhaps stated somewhat abstractly, an event is an element of the outcome space. Simply put, we can assign a probability to an event and assert whether it has happened or not. As we will soon show, it is not the same as a realization ω though.

此时,我们需要定义“事件” A∈Ω的概念。 一个事件可能是结果空间的一个元素,也许有些抽象地表述。 简而言之,我们可以为事件分配概率并断言事件是否发生。 正如我们将很快证明的那样,它与实现ω不同

A filtration is a mathematical model that represents partial knowledge about the outcome. In essence, it tells us whether an event happened or not. The `filtration process’ may be visualized as a sequence of filters, with each filter providing us a more detailed view . Concretely, in an RL context the filtration provides us with the information needed to compute the current state Sₜ, without giving any indication of future changes in the process [2]. Indeed, just like the Markov property.

过滤是一个数学模型,代表关于结果的部分知识。 从本质上讲,它告诉我们事件是否发生。 “过滤过程”可以可视化为一系列过滤器,每个过滤器为我们提供更详细的视图。 具体而言,在RL上下文中,过滤为我们提供了计算当前状态Sₜ所需的信息,而没有提供任何对过程未来变化的指示[2]。 确实,就像马尔可夫财产一样。

Formally, a filtration is a σ-algebra, and although you don’t need to know the ins and outs some background is useful. Loosely defined, a σ-algebra is a collection of subsets of the outcome space, containing a countable number of events as well as all their complements and unions. In measure theory this concept has major implications, for the purpose of this article you only need to remember that the σ-algebra is a collection of events.

形式上,过滤是一个σ-代数,尽管您不需要了解来龙去脉,但有些背景是有用的。 松散定义的σ-代数是结果空间子集的集合,其中包含可数的事件以及它们的所有互补和并集。 在量度理论中,此概念具有重要意义,对于本文而言,您只需要记住σ-代数是事件的集合。

Example revisited — discrete case

再看示例-离散情况

Back to the example, because the filtration only comes alive when seeing it into action. We first need to define the events, using sequences such as ‘udu’ to describe price movements over time. At t=0 we basically don’t know anything — all paths are still possible. Thus, the event set A={uuu, uud, udu, udd, ddd, ddu, dud, duu} contains all possible paths ω ∈ Ω. At t=1, we know a little more: the stock price went either up or down. The corresponding events are defined by Aᵤ={uuu,uud,udu,udd} and Aₔ={ddd,ddu,dud,duu}. If the stock price went up, we can surmise that our sample path ω will be in Aᵤ and not in Aₔ (and vice versa, of course). At t=2, we have four event sets: Aᵤᵤ={uuu,uud}, Aᵤₔ={udu,udd}, Aₔᵤ={duu,dud}, and Aₔₔ={ddu,ddd}. Observe that the information is getting increasingly fine-grained; the sets to which ω might belong are becoming smaller and more numerous. At t=3, we obviously know the exact price path that has been followed.

回到示例,因为过滤只有在生效时才会生效。 我们首先需要定义事件,使用诸如“ udu”之类的序列来描述价格随时间的变化。 在t = 0时,我们基本上什么都不知道-所有路径仍然可行。 因此,事件集A = {uuu,uud,udu,udd,ddd,ddu,dud,duu}包含所有可能的路径ω∈Ω 。 在t = 1时 ,我们知道的更多:股票价格上涨或下跌。 相应的事件由Aᵤ= { u uu, u ud, u du, u dd}Aₔ= { d dd, d du, d ud, d uu}定义 。 如果股价上涨,我们可以推测样本路径ω将在Aᵤ中 ,而不在Aₔ中 (当然,反之亦然)。 在t = 2时 ,我们有四个事件集: Aᵤᵤ= {uuu,uud}Aᵤₔ= { ud u, ud d}Aₔᵤ= { du u, du d}Aₔₔ= { dd u, dd d} 。 观察到信息越来越细化; ω可能属于的集合越来越小。 在t = 3时 ,我们显然知道遵循的确切价格路径。

Having defined the events, we can define the corresponding filtrations for t=0,1,2,3:

定义事件后,我们可以为t = 0,1,2,3定义相应的过滤:

Image for post
mage by author)作者制图 )

At t=0, every outcome is possible. We initialize the filtration with the empty set ∅ and outcome space Ω, also known as a trivial σ-algebra.

t = 0时 ,所有结果都是可能的。 我们用空集∅和结果空间Ω(也称为平凡 σ-代数)初始化过滤。

At t=1, we can simply add Aᵤ and Aₔ to ₀ to obtain ₁; recall from the definition that each filtration always includes all elements of its predecessor. We can use the freshly revealed information to compute S₁. We also get a peek into the future (without actually revealing future information!): if the price went up, we cannot reach the lowest possible price at t=3 anymore. The event sets are illustrated below

t = 1时 ,我们可以简单地将AᵤAₔ加到以获得obtain 。 从定义中回想起,每次过滤总是包含其前身的所有元素。 我们可以使用最新显示的信息来计算S₁ 。 我们还会窥视未来(实际上并不会透露未来的信息!):如果价格上涨,我们将无法再达到t = 3时的最低价格。 事件集如下所示

Image for post
Visualization of event sets Aᵤ and Aₔ at t=1 (Source: [2], image by author)
事件集AᵤAₔ在t = 1时的可视化(来源:[2], 作者 作图 )

At t=2, we may distinguish between four events depending on the price paths revealed so far. Here things get a bit more involved, as we also need to add the unions and complements (in line with the requirements of the σ-algebra). This was not necessary for ₁, as the union of Aᵤ and Aₔ equals the outcome space and Aᵤ is the complement of Aₔ. From an RL perspective, you might note that we have more information than strictly needed. For instance, an up-movement followed by a down-movement yields the same price as the reverse. In RL applications we would typically not store such redundant information, yet you can probably recognize the mathematical appeal.

t = 2时 ,我们可以根据到目前为止揭示的价格路径来区分四个事件。 这里的事情涉及更多,因为我们还需要添加并集和补码(符合σ-代数的要求)。 ℱℱ不需要 ,因为A necessaryAₔ的并等于结果空间,而AᵤAᵤ的补 。 从RL角度来看,您可能会注意到,我们掌握的信息超出了严格需要的信息。 例如,向上运动然后向下运动会产生与反向运动相同的价格。 在RL应用程序中,我们通常不会存储此类冗余信息,但您可能会认识到数学上的吸引力。

Image for post
Visualization of event sets Aᵤᵤ, Aᵤₔ, Aₔᵤ and Aₔₔ at t=2 (Source: [2], image by author)
事件集Aᵤᵤ, AᵤₔAₔᵤAₔₔ在t = 2时的可视化 (来源:[2], 作者 作图 )

At t=3, we already have 256 sets, using the same procedure as before. You can see that filtrations quickly become extremely large. A filtration always contains all elements of the preceding step — our filtration gets richer and more fine-grained with the passing of time. All this means is that we can more precisely pinpoint the events to which our sample price path may or may not belong.

t = 3处 ,我们已经有256套,使用与以前相同的过程。 您会看到过滤很快变得非常大。 过滤始终包含上一步的所有元素-随着时间的流逝,我们的过滤会变得越来越丰富,而且粒度越来越细。 这一切意味着我们可以更精确地查明样本价格路径可能或可能不属于的事件。

A continuous example

一个连续的例子

We are almost there, but we would be remiss if we only treat discrete problems. In reality, stock prices do not only go ‘up’ or ‘down’; they will change within a continuous domain. The same holds for many other RL problems. Although conceptually the same as for the discrete case, providing explicit descriptions for filtrations in continuous settings is difficult. Again, some illustrations might help more than formal definitions.

我们快到了,但是如果我们只处理离散的问题,我们将被解雇。 实际上,股票价格不仅会上涨或下跌。 他们将在一个连续的领域内变化。 许多其他RL问题也是如此。 尽管从概念上讲与离散情况相同,但是很难提供连续设置中过滤的明确描述。 同样,一些插图可能比正式定义更有帮助。

Suppose that at every time step, we simulate a return from the real domain [-d,u]. Depending on the time we look ahead, we may then define an interval in which the future stock price will fall, say [329,335] at a given point in time. We can then define intervals within this domain. Any arbitrary interval may constitute an event, for instance:

假设在每个时间步上,我们都模拟来自实域[-d,u]的收益。 根据我们的展望时间,我们可以定义一个时间间隔,即在给定的时间点,未来股价将下跌,例如[329,335] 。 然后,我们可以在此域内定义间隔。 任何任意间隔都可能构成一个事件,例如:

Image for post

The complement of an interval could look like

间隔的补码可能看起来像

Image for post

Furthermore, a plethora of unions may be defined, such as

此外,可以定义过多的联合,例如

Image for post

As you may have guessed, there are infinitely many of such events in all shapes and sizes, yet they are still countable and we can assign a probability to each of them [2,3].

正如您可能已经猜到的那样,各种形状和大小的事件有无数种,但它们仍然是可数的,我们可以为每个事件分配一个概率[2,3]。

The further we look into the future, the more we can deviate from our current stock price. We might visualize this with a cone shape that expands over time (displayed below for t=50 and t=80). Within the cone, we can define infinitely many intervals. As before, we acquire a more detailed view once more time has passed.

我们对未来的展望越深,与当前股价的偏差就越大。 我们可以用随时间扩展的圆锥形形象化(在下面显示t = 50t = 80 )。 在圆锥内,我们可以定义无限多个间隔。 和以前一样,一旦时间过去,我们将获得更详细的视图。

Image for post
Event sets in continuous domain for t=50 and t=80. Within the cones, infinitely many intervals can be defined to construct the filtrations. (Source: [2], image by author)
事件在t = 50和t = 80的连续域中设置。 在视锥内,可以定义无限多个间隔来构造过滤。 (来源:[2], 作者按我的说法 )

Wrapping things up

整理东西

When encountering filtrations in any RL paper, the basics treated in this article should suffice. Essentially, the only purpose of introducing filtrations ₜ is to ensure that decisions xₜ(ω) do not utilize information that has not yet been revealed. When the Markov property holds, a decision xₜ(Sₜ) that operates on the current state serves the same purpose. The filtration provides a rich description of the past, yet we do not need this information in memoryless problems. Nevertheless, from a mathematical perspective it is an elegant solution with many interesting applications. The reinforcement learning community consists of many researchers and engineers from different backgrounds working in a variety of domains, not everyone speaks the same language. Sometimes it goes a long way to learn another language, even if only a few words.

当在任何RL纸中遇到过滤时,本文所处理的基本知识就足够了。 从本质上讲,引入的过滤ℱₜ的唯一目的是确保决策xₜ(ω)不利用还未被透露的信息。 当马尔可夫属性成立时,在当前状态下运行的决策xₜ(Sₜ)具有相同的目的。 筛选提供了对过去的丰富描述,但是在无记忆问题中我们不需要此信息。 但是,从数学角度来看,它是一种优雅的解决方案,具有许多有趣的应用程序。 强化学习社区由来自不同背景的许多研究人员和工程师组成,这些研究人员和工程师在不同的领域工作,并不是每个人都讲相同的语言。 有时候,即使只有几个单词,学习另一种语言也会走很长的路。

[This article is partially based on my ArXiv article ‘A Gentle Lecture Note on Filtrations in Reinforcement Learning’]

[本文部分基于我的ArXiv文章“关于强化学习中的过滤的温和讲义”]

[1] Van Heeswijk, W.J.A. (2020). A Gentle Lecture Note on Filtrations in Reinforcement Learning. arXiv preprint arXiv:2008.02622

[1] Van Heeswijk,WJA(2020)。 关于强化学习中的过滤的温和的讲义。 arXiv预印本arXiv:2008.02622

[2] Shreve, S. E. (2004). Stochastic Calculus for Finance II: Continuous-Time Models, Volume 11. Springer Science & Business Media.

[2] Shreve,SE(2004)。 金融随机算术II:连续时间模型,第11卷。Springer科学与商业媒体。

[3] Shiryaev, A. N. (1996). Probability. Springer New York-Heidelberg.

[3] Shiryaev,AN(1996)。 可能性。 施普林格纽约-海德堡。

[4] Luenberger, D. G. (1997). Investment Science. Oxford University Press.

[4] Luenberger,DG(1997)。 投资科学。 牛津大学出版社。

[5] Powell, W. B. (2020). On State Variables, Bandit Problems and POMDPs. arXiv preprint arXiv:2002.06238

[5]鲍威尔,世界银行(2020)。 关于状态变量,强盗问题和POMDP。 arXiv预印本arXiv:2002.06238

翻译自: https://towardsdatascience.com/filtrations-in-reinforcement-learning-what-they-are-and-why-we-dont-need-them-463c93a170d4

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388500.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

LoadRunner8.1破解汉化过程

LR8.1版本已经将7.8和8.0中通用的license封了,因此目前无法使用LR8.1版本,包括该版本的中文补丁。 破解思路:由于软件的加密程序和运行的主程序是分开的,因此可以使用7.8的加密程序覆盖8.1中的加密程序,这样老的7.8和…

TCP/IP网络编程之基于TCP的服务端/客户端(二)

回声客户端问题 上一章TCP/IP网络编程之基于TCP的服务端/客户端(一)中,我们解释了回声客户端所存在的问题,那么单单是客户端的问题,服务端没有任何问题?是的,服务端没有问题,现在先让…

谈谈iOS获取调用链

本文由云社区发表iOS开发过程中难免会遇到卡顿等性能问题或者死锁之类的问题,此时如果有调用堆栈将对解决问题很有帮助。那么在应用中如何来实时获取函数的调用堆栈呢?本文参考了网上的一些博文,讲述了使用mach thread的方式来获取调用栈的步…

python 移动平均线_Python中的移动平均线

python 移动平均线There are situations, particularly when dealing with real-time data, when a conventional average is of little use because it includes old values which are no longer relevant and merely give a misleading impression of the current situation.…

html5字体的格式转换,font字体

路由器之家网今天精心准备的是《font字体》,下面是详解!html中的标签是什么意思HTML提供了文本样式标记,用来控制网页中文本的字体、字号和颜色,多种多样的文字效果可以使网页变得更加绚丽。其基本语法格式:文本内容fa…

红星美凯龙牵手新潮传媒抢夺社区消费市场

瞄准线下流量红利,红星美凯龙牵手新潮传媒抢夺社区消费市场 中新网1月14日电 2019年1月13日,红星美凯龙和新潮传媒战略合作发布会在北京召开,双方宣布建立全面的战略合作伙伴关系。未来,新潮传媒的梯媒产品将入驻红星美凯龙的全国…

机器学习 啤酒数据集_啤酒数据集上的神经网络

机器学习 啤酒数据集Artificial neural networks (ANNs), usually simply called neural networks (NNs), are computing systems vaguely inspired by the biological neural networks that constitute animal brains.人工神经网络(ANN)通常简称为神经网络(NNs),是…

ER TO SQL语句

ER TO SQL语句的转换,在数据库设计生命周期的位置如下所示。 一、转换的类别 从ER图转化得到关系数据库中的SQL表,一般可分为3类: 1)转化得到的SQL表与原始实体包含相同信息内容。该类转化一般适用于: 二元“多对多”关…

dede 5.7 任意用户重置密码前台

返回了重置的链接,还要把&amp删除了,就可以重置密码了 结果只能改test的密码,进去过后,这个居然是admin的密码,有点头大,感觉这样就没有意思了 我是直接上传的一句话,用菜刀连才有乐趣 OK了…

nasa数据库cm1数据集_获取下一个地理项目的NASA数据

nasa数据库cm1数据集NASA provides an extensive library of data points that they’ve captured over the years from their satellites. These datasets include temperature, precipitation and more. NASA hosts this data on a website where you can search and grab in…

r语言处理数据集编码_在强调编码语言或工具之前,请学习这3个基本数据概念

r语言处理数据集编码重点 (Top highlight)I got an Instagram DM the other day that really got me thinking. This person explained that they were a data analyst by trade, and had years of experience. But, they also said that they felt that their technical skill…

HTML和CSS面试问题总结,html和css面试总结

html和cssw3c 规范结构化标准语言样式标准语言行为标准语言1) 盒模型常见的盒模型有w3c盒模型(又名标准盒模型)box-sizing:content-box和IE盒模型(又名怪异盒模型)box-sizing:border-box。标准盒子模型:宽度内容的宽度(content) border padding margin低版本IE盒子…

山师计算机专业研究生怎么样,山东师范大学有计算机专业硕士吗?

山东师范大学位于山东省济南市,学校是一所综合性高等师范院校。该院校深受广大报考专业硕士学员的欢迎,因此很多学员想要知道山东师范大学有没有计算机专业硕士?山东师范大学是有计算机专业硕士的。下面就和大家介绍一下培养目标有哪些&#…

使用TensorFlow概率预测航空乘客人数

TensorFlow Probability uses structural time series models to conduct time series forecasting. In particular, this library allows for a “scenario analysis” form of modelling — whereby various forecasts regarding the future are made.TensorFlow概率使用结构…

python画激活函数图像

导入必要的库 import math import matplotlib.pyplot as plt import numpy as np import matplotlib as mpl mpl.rcParams[axes.unicode_minus] False 绘制softmax函数图像 fig plt.figure(figsize(6,4)) ax fig.add_subplot(111) x np.linspace(-10,10) y sigmoid(x)ax.s…

pdf.js插件使用记录,在线打开pdf

pdf.js插件使用记录,在线打开pdf 原文:pdf.js插件使用记录,在线打开pdf天记录一个js库:pdf.js。主要是实现在线打开pdf功能。因为项目需求需要能在线查看pdf文档,所以就研究了一下这个控件。 有些人很好奇,在线打开pdf…

程序员 sql面试_非程序员SQL使用指南

程序员 sql面试Today, the word of the moment is DATA, this little combination of 4 letters is transforming how all companies and their employees work, but most people don’t really know how data behaves or how to access it and they also think that this is j…

r a/b 测试_R中的A / B测试

r a/b 测试什么是A / B测试? (What is A/B Testing?) A/B testing is a method used to test whether the response rate is different for two variants of the same feature. For instance, you may want to test whether a specific change to your website lik…

Java基础回顾

内容: 1、Java中的数据类型 2、引用类型的使用 3、IO流及读写文件 4、对象的内存图 5、this的作用及本质 6、匿名对象 1、Java中的数据类型 Java中的数据类型有如下两种: 基本数据类型: 4类8种 byte(1) boolean(1) short(2) char(2) int(4) float(4) l…

计算机部分应用显示模糊,win10系统打开部分软件字体总显示模糊的解决方法-电脑自学网...

win10系统打开部分软件字体总显示模糊的解决方法。方法一:win10软件字体模糊1、首先,在Win10的桌面点击鼠标右键,选择“显示设置”。2、在“显示设置”的界面下方,点击“高级显示设置”。3、在“高级显示设置”的界面中&#xff0…