Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2

Title: Gymnasium Cart Pole 环境与 REINFORCE 算法 —— 强化学习入门 2


文章目录

  • I. Gymnasium Cart Pole 环境
  • II. REINFORCE 算法
    • 1. 原理说明
    • 2. REINFORCE 算法实现


I. Gymnasium Cart Pole 环境

Gymnasium Cart Pole 环境是一个倒立摆的动力学仿真环境.

状态空间:

0: Cart Position

1: Cart Velocity

2: Pole Angle

3: Pole Angular Velocity

动作空间:

0: Push cart to the left

1: Push cart to the right

即时激励:

为了更长时间地保持倒立摆呈倒立状态, 每一时间步都是获得即时激励 +1.

回合结束判据:

Termination: Pole Angle is greater than ±12°

Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)

Truncation: Episode length is greater than 200


II. REINFORCE 算法

1. 原理说明

REINFORCE 算法原理及 Python实现, 我们参考了 Foundations of Deep Reinforcement Learning: Theory and Practice in Python.
需要说明的是, 我们此处采用了 Improving REINFORCE
∇ θ J ( π θ ) ≈ ∑ t = 0 T ( R t ( τ ) − b ) ∇ θ log ⁡ π θ ( a t ∣ s t ) \nabla_{\theta} J(\pi_\theta) \approx \sum_{t=0}^{T} \left(R_t(\tau)-b\right) \nabla_{\theta}\log\pi_\theta(a_t|s_t) θJ(πθ)t=0T(Rt(τ)b)θlogπθ(atst)
其中 b b b 是整个轨迹上的回报均值, 是每条轨迹的常值基线.
b = 1 T ∑ t = 0 T R t ( τ ) b=\frac{1}{T} \sum_{t=0}^{T} R_t(\tau) b=T1t=0TRt(τ)
另外, 我们设定连续 15 次倒立摆控制成功后, 结束 REINFORCE 算法训练, 并保存策略映射神经网络.

测试的时候, 加载已保存的策略映射神经网络, 加长测试时间步, 也都能较好控制倒立摆.


2. REINFORCE 算法实现

REINFORCE 算法的策略映射网络:

class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalar

对策略映射网络的方向传播训练:

def train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return loss

多回合强化学习训练, 连续多次控制倒立摆成功就结束整个 REINFORCE 算法的训练.

def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)

一个简单的训练录屏

REINFORCE_training

测试需要在神经网络的 evaluation 模式下进行, 测试中可以完成更长时间的倒立摆控制.

def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')

一个简单的测试录屏

REINFORCE_testing

完整代码:

import gymnasium as gym
# import gymimport numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import sysgamma = 0.99 # discount factor
model_path = "./reinforce_pi.pt" class Pi(nn.Module):# a policy network to be optimized in reinforcement learning# 待优化的策略网络def __init__(self, in_dim, out_dim): # in_dim = 4, out_dim = 2# super(Pi, self).__init__()super().__init__()# a policy networklayers = [nn.Linear(in_dim, 64), # 4 -> 64nn.ReLU(), # activation functionnn.Linear(64, out_dim), # 64 -> 2]self.model = nn.Sequential(*layers) self.onpolicy_reset()  # initialize memoryself.train()  # Set the model to training modedef onpolicy_reset(self):self.log_probs = []self.rewards = []def forward(self, x): # x -> statepdparam = self.model(x) # forward passreturn pdparam# pdparam -> probability distribution# such as the logits of a categorical distributiondef act(self, state):# Convert the state from a NumPy array to a PyTorch tensor# 由策略网络输出的采样动作和对数概率分布x = torch.from_numpy(state.astype(np.float32)) # print("state: {}".format(state))pdparam = self.forward(x)     # Perform a forward pass through the neural network   # print("pdparam: {}".format(pdparam))# to obtain the probability distribution parameterspd = torch.distributions.Categorical(logits=pdparam) # probability distribution# print("pd.probs: {}\t pd.logits: {}".format(pd.probs, pd.logits))action = pd.sample()            # pi(a|s) in action via pd#calculates the log probability of the sampled action action under the probability distribution pd#$\log(\pi_{\theta}(a_t|s_t))$#where $\pi_{\theta}$ is the policy network,#	$a_t$ is the action at time step $t$,#	$s_t$ is the state at time step $t$log_prob = pd.log_prob(action)  # log_prob of pi(a|s), log_prob = pd.logitsself.log_probs.append(log_prob) # store for trainingreturn action.item()  # extracts the value of a single-element tensor as a scalardef train(pi, optimizer):# 以下利用蒙特卡洛法计算损失函数值,并利用梯度上升法更新策略网络参数# 蒙特卡洛法需要采样多条轨迹来求损失函数的均值,但是为了简化只采样了一条轨迹当做均值# Inner gradient-ascent loop of REINFORCE algorithmT = len(pi.rewards)rets = np.empty(T, dtype=np.float32)  # Initialize returnsfuture_ret = 0.0# compute the returns efficiently in reverse order# R_t(\tau) = \Sigma_{t'=t}^{T} {\gamma^{t'-t} r_{t'}}for t in reversed(range(T)):future_ret = pi.rewards[t] + gamma * future_retrets[t] = future_retbaseline = sum(rets) / Trets = torch.tensor(rets)rets = rets - baseline  # modify the returns by subtracting a baselinelog_probs = torch.stack(pi.log_probs)# - R_t(\tau) * log(\pi_{\theta}(a_t|s_t))# Negative for maximizingloss = - log_probs * rets  #  - \Sigma_{t=0}^{T}  [R_t(\tau) * log(\pi_{\theta}(a_t|s_t))] loss = torch.sum(loss)optimizer.zero_grad()# backpropagate, compute gradients# computes the gradients of the loss with respect to the model's parameters (\theta)loss.backward()   # gradient-ascent, update the weights of the policy network          optimizer.step()            return lossdef save_model(pi):print("pi.state_dict(): {}\n\n".format(pi.state_dict()))for param_tensor in pi.state_dict():print(param_tensor, "\t", pi.state_dict()[param_tensor].size())torch.save(pi, model_path)def train_main():env = gym.make('CartPole-v1', render_mode="human")in_dim = env.observation_space.shape[0] # 4out_dim = env.action_space.n # 2pi = Pi(in_dim, out_dim)   # an ibstance of the policy network for REINFORCE algorithmoptimizer = optim.Adam(pi.parameters(), lr=0.01)episode = 0continuous_solved_episode = 0# for epi in range(300): # episode = 300while continuous_solved_episode <= 14:# state = env.reset() # gymstate, _ = env.reset()  # gymnasiumfor t in range(200):  # cartpole max timestep is 200action = pi.act(state)# state, reward, done, _ = env.step(action)  # gymstate, reward, done, _, _ = env.step(action)  # gymnasiumpi.rewards.append(reward)env.render()if done:breakloss = train(pi, optimizer) # train per episodetotal_reward = sum(pi.rewards)   solved = total_reward > 195.0episode += 1if solved:continuous_solved_episode += 1else:continuous_solved_episode = 0print(f'Episode {episode}, loss: {loss}, \total_reward: {total_reward}, solved: {solved}, contnuous_solved: {continuous_solved_episode}')pi.onpolicy_reset()   # onpolicy: clear memory after trainingsave_model(pi)def usage():if len(sys.argv) != 2:print("Usage: python ./REINFORCE.py --train/--test")sys.exit()mode = sys.argv[1]return mode def test_process():env = gym.make('CartPole-v1', render_mode="human")# in_dim = env.observation_space.shape[0] # 4# out_dim = env.action_space.n # 2# pi_model = Pi(in_dim, out_dim)pi_model = torch.load(model_path)# set the model to evaluation modepi_model.eval()# 进行前向传播with torch.no_grad():pi_model.onpolicy_reset()   # onpolicy: clear memory after trainingstate, _ = env.reset()  # gymnasiumsteps = 600for t in range(steps):  # cartpole max timestep is 2000action = pi_model.act(state)state, reward, done, _, _ = env.step(action) pi_model.rewards.append(reward)env.render()if done:breaktotal_reward = sum(pi_model.rewards)   solved = total_reward >= stepsprint(f'[Test] total_reward: {total_reward}, solved: {solved}')if __name__ == '__main__':mode = usage()if mode == "--train":train_main()elif mode == "--test":test_process()

版权声明:本文为博主原创文章,遵循 CC 4.0 BY 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/woyaomaishu2/article/details/146382384
本文作者:wzf@robotics_notes

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/bicheng/73891.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Python高级:GIL、C扩展与分布式系统深度解析

文章目录 &#x1f4cc; **前言**&#x1f527; **第一章&#xff1a;Python语言的本质与生态**1.1 **Python的实现与版本演进**1.2 **开发环境与工具链** &#x1f527; **第二章&#xff1a;元编程与动态特性**2.1 **描述符协议&#xff08;Descriptor Protocol&#xff09;*…

C++学习笔记(二十一)——文件读写

一、文件读写 作用&#xff1a; 文件读写指的是将数据从程序存储到文件&#xff0c;或从文件读取数据&#xff0c;以实现数据的持久化存储。 C 提供了 fstream 头文件&#xff0c;用于文件操作&#xff0c;主要包括&#xff1a; ofstream&#xff08;输出文件流&#xff09;—…

RBA+minibatch的尝试

目录 还是咬着牙来写 RBA了 JAX JAX->TORCH torch tensor的变形 pytorch怎么把一个【3,3,5】的tensor变成【3,10,5】&#xff0c;多的用0填充 pytorch如何把shape【100】转成【100,1】 把torch shape【100,1】变成【100】 SQUEEZE grad_fn 不能两次反向传播 还…

基于Python+Django的二手房信息管理系统

项目介绍 PythonDjango二手房信息管理系统(Pycharm Django Vue Mysql) 平台采用B/S结构&#xff0c;后端采用主流的Python语言进行开发&#xff0c;前端采用主流的Vue.js进行开发。 整个平台包括前台和后台两个部分。 - 前台功能包括&#xff1a;首页、二手房信息、公告管理、…

爬虫基础之爬取猫眼Top100 可视化

网站: TOP100榜 - 猫眼电影 - 一网打尽好电影 本次案例所需用到的模块 requests (发送HTTP请求) pandas(数据处理和分析 保存数据) parsel(解析HTML数据) pyecharts(数据可视化图表) pymysql(连接和操作MySQL数据库) lxml(数据解析模块) 确定爬取的内容: 电影名称 电影主演…

解决Qt信号在构造函数中失效的问题

情景引入&#xff1a;音乐播放器的“幽灵列表”问题 假设你正在开发一个音乐播放器应用&#xff0c;其中有一个功能是用户首次打开应用时&#xff0c;需要从服务器拉取最新的歌曲列表并显示在“本地音乐”页面中。你可能会写出类似这样的代码&#xff1a; // LocalSong 类的构…

Hadoop 启动,发现 namenode、secondary namenodes,这两个没有启动,报错超时。

今天在启动 hadoop 的时候&#xff0c;发现本应该同时启动的 namenode、secondary namenodes 却都没有启动。我还以为是坏了又重新装了虚拟机&#xff0c;重新下载 Hadoop 重新配置结果还是同样的问题&#xff0c;那没办法只能去解决问题了。 首先先再次尝试启动看他报错是什么…

Ranger 鉴权

Apache Ranger 是一个用来在 Hadoop 平台上进行监控&#xff0c;启用服务&#xff0c;以及全方位数据安全访问管理的安全框架。 使用 ranger 后&#xff0c;会通过在 Ranger 侧配置权限代替在 Doris 中执行 Grant 语句授权。 Ranger 的安装和配置见下文&#xff1a;安装和配置 …

Sqlserver安全篇之_启用和禁用Named Pipes的案列介绍

https://learn.microsoft.com/zh-cn/sql/tools/configuration-manager/named-pipes-properties?viewsql-server-ver16 https://learn.microsoft.com/zh-cn/sql/tools/configuration-manager/client-protocols-named-pipes-properties-protocol-tab?viewsql-server-ver16 默认…

深入解析过滤器模式(Filter Pattern):一种灵活高效的设计模式

过滤器模式&#xff08;Filter Pattern&#xff09;&#xff0c;也被称为标准模式&#xff0c;是一种常见的结构型设计模式。它通过将对象分为不同的标准或条件&#xff0c;使得对对象集合的操作变得更加灵活和高效。特别适用于处理复杂查询和条件过滤的场景。过滤器模式不仅能…

Spring Boot 整合 Elasticsearch 实践:从入门到上手

引言 Elasticsearch 是一个开源的分布式搜索引擎&#xff0c;广泛用于日志分析、搜索引擎、数据分析等场景。本文将带你通过一步步的教程&#xff0c;在 Spring Boot 项目中整合 Elasticsearch&#xff0c;轻松实现数据存储与查询。 1. 创建 Spring Boot 项目 首先&#xff…

2025年Postman的五大替代工具

虽然Postman是一个广泛使用的API测试工具&#xff0c;但许多用户在使用过程中会遇到各种限制和不便。因此&#xff0c;可能需要探索替代解决方案。本文介绍了10款强大的替代工具&#xff0c;它们能够有效替代Postman&#xff0c;成为你API测试工具箱的一部分。 什么是Postman&…

Redis之单线程与多线程

redis 单线程与多线程 Redis是单线程&#xff0c;主要是指Redis的网络IO和键值对读写是由一个线程来完成的&#xff0c;Redis在处理客户端的请求时包含获取(socket读)、解析、执行、内容返回&#xff08;socket写&#xff09;等都由一个顺序串行的主线程处理&#xff0c;这就是…

C#的简单工厂模式、工厂方法模式、抽象工厂模式

工厂模式是一种创建型设计模式&#xff0c;主要将对象的创建和使用分离&#xff0c;使得系统更加灵活和可维护。常见的工厂模式有简单工厂模式、工厂方法模式和抽象工厂模式&#xff0c;以下是 C# 实现的三个案例&#xff1a; 简单工厂模式 简单工厂模式通过一个工厂类来创建…

python基础8 单元测试

通过前面的7个章节&#xff0c;作者学习了python的各项基础知识&#xff0c;也学习了python的编译和执行。但在实际环境上&#xff0c;我们需要验证我们的代码功能符合我们的设计预期&#xff0c;所以需要结合python的单元测试类&#xff0c;编写单元测试代码。 Python有一个内…

算法刷题力扣

先把大写的字母变成小写的&#xff0c;用大写字母32即可变为小写字母。 写循环跳过字符。 然后判断是否相等即可。具体代码如下&#xff1a; class Solution { public: bool isPalindrome(string s) { int sizes.size(); int begin0; int ends.size()-1; for(int i0;i<s…

allure下载安装及配置

这里写目录标题 一、JDK下载安装及配置二、allure下载三、allure安装四、allure环境变量配置五、allure验证是否安装成功 一、JDK下载安装及配置 allure 是一个java测试报告框架。所以要基于JDK环境。 JDK下载与安装及配置&#xff1a;https://blog.csdn.net/qq_24741027/arti…

linux之 内存管理(1)-armv8 内核启动页表建立过程

一、内核启动时&#xff0c;页表映射有哪些&#xff1f; Linux初始化过程&#xff0c;会依次建立如下页表映射&#xff1a; 1.恒等映射&#xff1a;页表基地址idmap_pg_dir; 2.粗粒度内核镜像映射&#xff1a;页表基地址init_pg_dir; 3.fixmap映射&#xff1a;页表基地址为…

【面试问题】Java 接口与抽象类的区别

引言 在 Java 面向对象编程中&#xff0c;接口&#xff08;Interface&#xff09;和抽象类&#xff08;Abstract Class&#xff09;是两个重要的抽象工具。它们都能定义未实现的方法&#xff0c;但设计目标和使用场景截然不同。本文将通过语法、特性和实际案例&#xff0c;深入…

【资料分享】全志科技T113-i全国产(1.2GHz双核A7 RISC-V)工业核心板规格书

核心板简介 创龙科技SOM-TLT113 是一款基于全志科技T113-i 双核ARM Cortex-A7 玄铁C906 RISC-V HiFi4 DSP 异构多核处理器设计的全国产工业核心板&#xff0c;ARM Cortex-A7 处理单元主频高达1.2GHz。核心板 CPU、ROM、RAM、电源、晶振等所有元器件均采用国产工业级方案&…