【深度 Q 学习-01】 Q学习概念和python实现

文章目录

一、说明
二、深度 Q 学习概念
三、python实现
四、结论

关键词：Deep Q-Networks

一、说明

在强化学习（RL）中，Q 学习是一种基础算法，它通过学习策略来最大化累积奖励，从而帮助智能体导航其环境。它通过更新操作值函数来实现这一点，该函数根据收到的奖励和未来的估计来估计在给定状态下采取特定操作的预期效用。

二、深度 Q 学习概念

Q-Learning是先决条件，因为 Q-Learning 的过程为工作代理创建了一个精确的矩阵，它可以“参考”该矩阵以在长期内最大化其奖励。虽然这种方法本身并没有错，但它只适用于非常小的环境，并且当环境中的状态和动作数量增加时，它很快就会失去可行性。上述问题的解决方案来自于认识到矩阵中的值仅具有相对重要性，即这些值仅相对于其他值具有重要性。因此，这种想法将我们引向了深度 Q-Learning，它使用深度神经网络来近似这些值。只要保留相对重要性，这种值的近似值就不会造成伤害。深度 Q-Learning 的基本工作步骤是将初始状态输入神经网络，并返回所有可能动作的 Q 值作为输出。Q-Learning 和深度 Q-Learning 之间的区别可以说明如下：
在这里插入图片描述

伪代码：

观察方程target = R(s,a,s’) + 中的项 \gamma max_{a’}Q_{k}(s’,a’) 是变量项。因此，在此过程中，神经网络的目标是可变的，这与其他典型的深度学习过程不同，在这些过程中目标是固定的。通过使用两个神经网络而不是一个神经网络可以解决此问题。一个神经网络用于调整网络参数，另一个神经网络用于计算目标，其架构与第一个网络相同，但参数已冻结。在主网络中进行 x 次迭代后，参数将复制到目标网络。

在这里插入图片描述

深度 Q 学习是一种强化学习算法，它使用深度神经网络来近似 Q 函数，该函数用于确定在给定状态下采取的最佳行动。Q 函数表示在特定状态下采取特定行动并遵循特定策略的预期累积奖励。在 Q 学习中，Q 函数会随着代理与环境的交互而迭代更新。深度 Q 学习用于各种应用，例如游戏、机器人技术和自动驾驶汽车。

深度 Q 学习是 Q 学习的一种变体，它使用深度神经网络来表示 Q 函数，而不是简单的值表。这使得算法能够处理具有大量状态和动作的环境，以及从图像或传感器数据等高维输入中学习。

实施深度 Q 学习的关键挑战之一是 Q 函数通常是非线性的，并且可能具有许多局部最小值。这会使神经网络难以收敛到正确的 Q 函数。为了解决这个问题，已经提出了几种技术，例如经验回放和目标网络。

经验回放是一种技术，其中代理将其经验的子集（状态、动作、奖励、下一个状态）存储在内存缓冲区中，并从该缓冲区中采样以更新 Q 函数。这有助于消除数据的相关性并使学习过程更加稳定。另一方面，目标网络用于稳定 Q 函数更新。在这种技术中，使用单独的网络来计算目标 Q 值，然后将其用于更新 Q 函数网络。

深度 Q 学习已应用于各种问题，包括游戏、机器人技术和自动驾驶汽车。例如，它已用于训练能够玩 Atari 和围棋等游戏的代理，以及控制机器人执行抓取和导航等任务。

三、python实现

强化学习是一种机器学习范例，其中学习算法不是根据预设数据进行训练，而是基于反馈系统。这些算法被誉为机器学习的未来，因为它们消除了收集和清理数据的成本。

在本文中，我们将演示如何实现称为 Q-Learning 技术的基本强化学习算法。在此演示中，我们尝试使用 Q-Learning 技术教会机器人到达目的地。

第1步：导入所需的库

import numpy as np 
import pylab as pl 
import networkx as nx

第 2 步：定义和可视化图表

edges = [(0, 1), (1, 5), (5, 6), (5, 4), (1, 2),  (1, 3), (9, 10), (2, 4), (0, 6), (6, 7), (8, 9), (7, 8), (1, 7), (3, 9)] goal = 10
G = nx.Graph() 
G.add_edges_from(edges) 
pos = nx.spring_layout(G) 
nx.draw_networkx_nodes(G, pos) 
nx.draw_networkx_edges(G, pos) 
nx.draw_networkx_labels(G, pos) 
pl.show()

在这里插入图片描述
注意：上图在代码复制时可能看起来不一样，因为 python 中的 networkx 库根据给定的边生成随机图。

第 3 步：定义机器人的奖励系统

MATRIX_SIZE = 11
M = np.matrix(np.ones(shape =(MATRIX_SIZE, MATRIX_SIZE))) 
M *= -1for point in edges: print(point) if point[1] == goal: M[point] = 100else: M[point] = 0if point[0] == goal: M[point[::-1]] = 100else: M[point[::-1]]= 0# reverse of point M[goal, goal]= 100
print(M) 
# add goal point round trip

在这里插入图片描述

步骤 4：定义训练中使用的一些实用函数

Q = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE])) gamma = 0.75
# learning parameter 
initial_state = 1# Determines the available actions for a given state 
def available_actions(state): current_state_row = M[state, ] available_action = np.where(current_state_row >= 0)[1] return available_action available_action = available_actions(initial_state) # Chooses one of the available actions at random 
def sample_next_action(available_actions_range): next_action = int(np.random.choice(available_action, 1)) return next_action action = sample_next_action(available_action) def update(current_state, action, gamma): max_index = np.where(Q[action, ] == np.max(Q[action, ]))[1] 
if max_index.shape[0] > 1: max_index = int(np.random.choice(max_index, size = 1)) 
else: max_index = int(max_index) 
max_value = Q[action, max_index] 
Q[current_state, action] = M[current_state, action] + gamma * max_value 
if (np.max(Q) > 0): return(np.sum(Q / np.max(Q)*100)) 
else: return (0) 
# Updates the Q-Matrix according to the path chosen update(initial_state, action, gamma)

第 5 步：使用 Q 矩阵训练和评估机器人

scores = [] 
for i in range(1000): current_state = np.random.randint(0, int(Q.shape[0])) available_action = available_actions(current_state) action = sample_next_action(available_action) score = update(current_state, action, gamma) scores.append(score) # print("Trained Q matrix:") 
# print(Q / np.max(Q)*100) 
# You can uncomment the above two lines to view the trained Q matrix # Testing 
current_state = 0
steps = [current_state] while current_state != 10: next_step_index = np.where(Q[current_state, ] == np.max(Q[current_state, ]))[1] if next_step_index.shape[0] > 1: next_step_index = int(np.random.choice(next_step_index, size = 1)) else: next_step_index = int(next_step_index) steps.append(next_step_index) current_state = next_step_index print("Most efficient path:") 
print(steps) pl.plot(scores) 
pl.xlabel('No of iterations') 
pl.ylabel('Reward gained') 
pl.show()

在这里插入图片描述

现在，让我们将该机器人带到更现实的环境中。让我们想象一下，机器人是一名侦探，正在试图找出一个大型贩毒团伙的位置。他自然得出结论，毒贩不会在已知警方经常出没的地点出售其产品，而且出售地点靠近贩毒地点。此外，卖家会在销售产品的地方留下痕迹，这可以帮助侦探找到所需的位置。我们希望训练我们的机器人使用这些环境线索找到位置。

第 6 步：使用环境线索定义和可视化新图

# Defining the locations of the police and the drug traces 
police = [2, 4, 5] 
drug_traces = [3, 8, 9] G = nx.Graph() 
G.add_edges_from(edges) 
mapping = {0:'0 - Detective', 1:'1', 2:'2 - Police', 3:'3 - Drug traces', 4:'4 - Police', 5:'5 - Police', 6:'6', 7:'7', 8:'Drug traces', 9:'9 - Drug traces', 10:'10 - Drug racket location'} H = nx.relabel_nodes(G, mapping) 
pos = nx.spring_layout(H) 
nx.draw_networkx_nodes(H, pos, node_size =[200, 200, 200, 200, 200, 200, 200, 200]) 
nx.draw_networkx_edges(H, pos) 
nx.draw_networkx_labels(H, pos) 
pl.show()

在这里插入图片描述
注意：上图可能看起来与之前的图有点不同，但实际上它们是相同的图。这是由于 networkx 库随机放置节点造成的。

步骤 7：为训练过程定义一些实用函数

Q = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE])) 
env_police = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE])) 
env_drugs = np.matrix(np.zeros([MATRIX_SIZE, MATRIX_SIZE])) 
initial_state = 1# Same as above 
def available_actions(state): current_state_row = M[state, ] av_action = np.where(current_state_row >= 0)[1] return av_action # Same as above 
def sample_next_action(available_actions_range): next_action = int(np.random.choice(available_action, 1)) return next_action # Exploring the environment 
def collect_environmental_data(action): found = [] if action in police: found.append('p') if action in drug_traces: found.append('d') return (found) available_action = available_actions(initial_state) 
action = sample_next_action(available_action) def update(current_state, action, gamma): 
max_index = np.where(Q[action, ] == np.max(Q[action, ]))[1] 
if max_index.shape[0] > 1: max_index = int(np.random.choice(max_index, size = 1)) 
else: max_index = int(max_index) 
max_value = Q[action, max_index] 
Q[current_state, action] = M[current_state, action] + gamma * max_value 
environment = collect_environmental_data(action) 
if 'p' in environment: env_police[current_state, action] += 1
if 'd' in environment: env_drugs[current_state, action] += 1
if (np.max(Q) > 0): return(np.sum(Q / np.max(Q)*100)) 
else: return (0) 
# Same as above 
update(initial_state, action, gamma) def available_actions_with_env_help(state): current_state_row = M[state, ] av_action = np.where(current_state_row >= 0)[1] # if there are multiple routes, dis-favor anything negative env_pos_row = env_matrix_snap[state, av_action] if (np.sum(env_pos_row < 0)): # can we remove the negative directions from av_act? temp_av_action = av_action[np.array(env_pos_row)[0]>= 0] if len(temp_av_action) > 0: av_action = temp_av_action return av_action 
# Determines the available actions according to the environment

步骤 8：可视化环境矩阵

scores = [] 
for i in range(1000): current_state = np.random.randint(0, int(Q.shape[0])) available_action = available_actions(current_state) action = sample_next_action(available_action) score = update(current_state, action, gamma) # print environmental matrices 
print('Police Found') 
print(env_police) 
print('') 
print('Drug traces Found') 
print(env_drugs)

在这里插入图片描述

第 9 步：训练和评估模型

scores = [] 
for i in range(1000): current_state = np.random.randint(0, int(Q.shape[0])) available_action = available_actions_with_env_help(current_state) action = sample_next_action(available_action) score = update(current_state, action, gamma) scores.append(score) pl.plot(scores) 
pl.xlabel('Number of iterations') 
pl.ylabel('Reward gained') 
pl.show()