交互网站建设/百度地图的精准定位功能

交互网站建设,百度地图的精准定位功能,西安优秀的集团门户网站建设公司,杭州网络传媒有限公司Title: Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2 文章目录 I. Gymnasium Cart Pole 环境II. REINFORCE 算法1. 原理说明2. REINFORCE 算法实现 I. Gymnasium Cart Pole 环境 Gymnasium Cart Pole 环境是一个倒立摆的动力学仿真环境. 状态空间: 0: Ca…

Title: Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2


文章目录

  • I. Gymnasium Cart Pole 环境
  • II. REINFORCE 算法
    • 1. 原理说明
    • 2. REINFORCE 算法实现


I. Gymnasium Cart Pole 环境

Gymnasium Cart Pole 环境是一个倒立摆的动力学仿真环境.

状态空间:

0: Cart Position

1: Cart Velocity

2: Pole Angle

3: Pole Angular Velocity

动作空间:

0: Push cart to the left

1: Push cart to the right

即时激励:

为了更长时间地保持倒立摆呈倒立状态, 每一时间步都是获得即时激励 +1.

回合结束判据:

Termination: Pole Angle is greater than ±12°

Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)

Truncation: Episode length is greater than 200


II. REINFORCE 算法

1. 原理说明

REINFORCE 算法原理及 Python实现, 我们参考了 Foundations of Deep Reinforcement Learning: Theory and Practice in Python.
需要说明的是, 我们此处采用了 Improving REINFORCE
∇ θ J ( π θ ) ≈ ∑ t = 0 T ( R t ( τ ) − b ) ∇ θ log ⁡ π θ ( a t ∣ s t ) \nabla_{\theta} J(\pi_\theta) \approx \sum_{t=0}^{T} \left(R_t(\tau)-b\right) \nabla_{\theta}\log\pi_\theta(a_t|s_t) θJ(πθ)t=0T(Rt(τ)b)θlogπθ(atst)
其中 b b b 是整个轨迹上的回报均值, 是每条轨迹的常值基线.
b = 1 T ∑ t = 0 T R t ( τ ) b=\frac{1}{T} \sum_{t=0}^{T} R_t(\tau) b=T1t=0TRt(τ)
另外, 我们设定连续 15 次倒立摆控制成功后, 结束 REINFORCE 算法训练, 并保存策略映射神经网络.

测试的时候, 加载已保存的策略映射神经网络, 加长测试时间步, 也都能较好控制倒立摆.


2. REINFORCE 算法实现

REINFORCE 算法的策略映射网络:

class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalar

对策略映射网络的方向传播训练:

def train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return loss

多回合强化学习训练, 连续多次控制倒立摆成功就结束整个 REINFORCE 算法的训练.

def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)

一个简单的训练录屏

REINFORCE_training

测试需要在神经网络的 evaluation 模式下进行, 测试中可以完成更长时间的倒立摆控制.

def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')

一个简单的测试录屏

REINFORCE_testing

完整代码:

import gymnasium as gym
# import gymimport numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import sysgamma = 0.99 # discount factor
model_path = "./reinforce_pi.pt" class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalardef train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return lossdef save_model(pi):print("pi.state_dict(): {}\n\n".format(pi.state_dict()))for param_tensor in pi.state_dict():print(param_tensor, "\t", pi.state_dict()[param_tensor].size())torch.save(pi, model_path)def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)def usage():if len(sys.argv) != 2:print("Usage: python ./REINFORCE.py --train/--test")sys.exit()mode = sys.argv[1]return mode def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')if __name__ == '__main__':mode = usage()if mode == "--train":train_main()elif mode == "--test":test_process()

版权声明:本文为博主原创文章,遵循 CC 4.0 BY 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/woyaomaishu2/article/details/146382384
本文作者:wzf@robotics_notes

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/73891.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

RBA+minibatch的尝试

目录 还是咬着牙来写 RBA了 JAX JAX->TORCH torch tensor的变形 pytorch怎么把一个【3,3,5】的tensor变成【3,10,5】&#xff0c;多的用0填充 pytorch如何把shape【100】转成【100,1】 把torch shape【100,1】变成【100】 SQUEEZE grad_fn 不能两次反向传播 还…

基于Python+Django的二手房信息管理系统

项目介绍 PythonDjango二手房信息管理系统(Pycharm Django Vue Mysql) 平台采用B/S结构&#xff0c;后端采用主流的Python语言进行开发&#xff0c;前端采用主流的Vue.js进行开发。 整个平台包括前台和后台两个部分。 - 前台功能包括&#xff1a;首页、二手房信息、公告管理、…

爬虫基础之爬取猫眼Top100 可视化

网站: TOP100榜 - 猫眼电影 - 一网打尽好电影 本次案例所需用到的模块 requests (发送HTTP请求) pandas(数据处理和分析 保存数据) parsel(解析HTML数据) pyecharts(数据可视化图表) pymysql(连接和操作MySQL数据库) lxml(数据解析模块) 确定爬取的内容: 电影名称 电影主演…

Hadoop 启动,发现 namenode、secondary namenodes,这两个没有启动,报错超时。

今天在启动 hadoop 的时候&#xff0c;发现本应该同时启动的 namenode、secondary namenodes 却都没有启动。我还以为是坏了又重新装了虚拟机&#xff0c;重新下载 Hadoop 重新配置结果还是同样的问题&#xff0c;那没办法只能去解决问题了。 首先先再次尝试启动看他报错是什么…

Ranger 鉴权

Apache Ranger 是一个用来在 Hadoop 平台上进行监控&#xff0c;启用服务&#xff0c;以及全方位数据安全访问管理的安全框架。 使用 ranger 后&#xff0c;会通过在 Ranger 侧配置权限代替在 Doris 中执行 Grant 语句授权。 Ranger 的安装和配置见下文&#xff1a;安装和配置 …

Sqlserver安全篇之_启用和禁用Named Pipes的案列介绍

https://learn.microsoft.com/zh-cn/sql/tools/configuration-manager/named-pipes-properties?viewsql-server-ver16 https://learn.microsoft.com/zh-cn/sql/tools/configuration-manager/client-protocols-named-pipes-properties-protocol-tab?viewsql-server-ver16 默认…

2025年Postman的五大替代工具

虽然Postman是一个广泛使用的API测试工具&#xff0c;但许多用户在使用过程中会遇到各种限制和不便。因此&#xff0c;可能需要探索替代解决方案。本文介绍了10款强大的替代工具&#xff0c;它们能够有效替代Postman&#xff0c;成为你API测试工具箱的一部分。 什么是Postman&…

Redis之单线程与多线程

redis 单线程与多线程 Redis是单线程&#xff0c;主要是指Redis的网络IO和键值对读写是由一个线程来完成的&#xff0c;Redis在处理客户端的请求时包含获取(socket读)、解析、执行、内容返回&#xff08;socket写&#xff09;等都由一个顺序串行的主线程处理&#xff0c;这就是…

算法刷题力扣

先把大写的字母变成小写的&#xff0c;用大写字母32即可变为小写字母。 写循环跳过字符。 然后判断是否相等即可。具体代码如下&#xff1a; class Solution { public: bool isPalindrome(string s) { int sizes.size(); int begin0; int ends.size()-1; for(int i0;i<s…

allure下载安装及配置

这里写目录标题 一、JDK下载安装及配置二、allure下载三、allure安装四、allure环境变量配置五、allure验证是否安装成功 一、JDK下载安装及配置 allure 是一个java测试报告框架。所以要基于JDK环境。 JDK下载与安装及配置&#xff1a;https://blog.csdn.net/qq_24741027/arti…

linux之 内存管理(1)-armv8 内核启动页表建立过程

一、内核启动时&#xff0c;页表映射有哪些&#xff1f; Linux初始化过程&#xff0c;会依次建立如下页表映射&#xff1a; 1.恒等映射&#xff1a;页表基地址idmap_pg_dir; 2.粗粒度内核镜像映射&#xff1a;页表基地址init_pg_dir; 3.fixmap映射&#xff1a;页表基地址为…

【资料分享】全志科技T113-i全国产(1.2GHz双核A7 RISC-V)工业核心板规格书

核心板简介 创龙科技SOM-TLT113 是一款基于全志科技T113-i 双核ARM Cortex-A7 玄铁C906 RISC-V HiFi4 DSP 异构多核处理器设计的全国产工业核心板&#xff0c;ARM Cortex-A7 处理单元主频高达1.2GHz。核心板 CPU、ROM、RAM、电源、晶振等所有元器件均采用国产工业级方案&…

R语言高效数据处理-自定义格式EXCEL数据输出

注&#xff1a;以下代码均为实际数据处理中的笔记摘录&#xff0c;所以很零散&#xff0c; 将就看吧&#xff0c;这一篇只是代表着我还在&#xff0c;所以可能用处不大&#xff0c;这一段时间都很煎熬&#xff01; 在实际数据处理中为了提升效率&#xff0c;将Excel报表交付给…

SpringBoot-2整合MyBatis以及基本的使用方法

目录 1.引入依赖 2.数据库表的创建 3.数据源的配置 4.编写pojo类 5.编写controller类 6.编写接口 7.编写接口的实现类 8.编写mapper 1.引入依赖 在pom.xml引入依赖 <!-- mysql--><dependency><groupId>com.mysql</groupId><artifac…

[蓝桥杯 2023 省 B] 飞机降落(不会dfs的看过来)

[蓝桥杯 2023 省 B] 飞机降落 题目描述 N N N 架飞机准备降落到某个只有一条跑道的机场。其中第 i i i 架飞机在 T i T_{i} Ti​ 时刻到达机场上空&#xff0c;到达时它的剩余油料还可以继续盘旋 D i D_{i} Di​ 个单位时间&#xff0c;即它最早可以于 T i T_{i} Ti​ 时刻…

英伟达GTC 2025大会产品全景剖析与未来路线深度洞察分析

【完整版】3月19日&#xff0c;黄仁勋Nvidia GTC 2025 主题演讲&#xff5c;英伟达 英伟达GTC 2025大会产品全景剖析与未来路线深度洞察分析 一、引言 1.1 分析内容 本研究主要采用了文献研究法、数据分析以及专家观点引用相结合的方法。在文献研究方面&#xff0c;广泛收集了…

Prometheus使用

介绍&#xff1a;Prometheus 是一个开源的 监控与告警系统&#xff0c;主要用于采集和存储时间序列数据&#xff08;Time Series Data&#xff09; Prometheus的自定义查询语言PromQL Metric类型 为了能够帮助用户理解和区分这些不同监控指标之间的差异&#xff0c;Prometheu…

Zabbix监控自动化(Zabbix Mnitoring Automation)

​​​​​​zabbix监控自动化 1、自动化监控(网络发现与自动注册只能用其一) 1.1 ansible安装zabbix agent 新采购100台服务器&#xff1a; 1、安装操作系统 2、初始化操作系统 3、安装zabbix agent 1.手动部暑 2.脚本部暑(shell expect) 3.ansible 4、纳入监控 1.…

QtCreator16创建WebAssembly工程在浏览器中显示图片

显示效果&#xff1a; 实现过程&#xff1a; 添加qrc资源文件 输入文件名&#xff1a; 选择模板为Qt Resource File 在工程目录下创建res文件夹&#xff0c;复制图片文件到res中 编辑qrc文件 添加资源前缀 添加图片资源 选择图片资源添加别名 复制资源URL 使用别名调用资源 居…

openpnp - 如果安装面的钣金接触面不平,可以尝试加垫片

文章目录 openpnp - 如果安装面的钣金接触面不平&#xff0c;可以尝试加垫片概述吐槽备注END openpnp - 如果安装面的钣金接触面不平&#xff0c;可以尝试加垫片 概述 在X轴导轨上&#xff0c;架上百分表&#xff0c;打设备的工作平面的平面度&#xff0c;发现工作平面不平(和…