让AI玩一千万次贪吃蛇

如果让人工智能来玩贪吃蛇游戏，会发生什么？

图源：DALL·E

贪吃蛇实现

游戏规则

游戏实现

Q学习算法实现

Q学习简介

Q表和Q值

Q学习更新规则

Q学习在贪吃蛇游戏中的应用

整体项目完整代码

运行过程截图

代码分析

环境设置

蛇的行为定义

Q学习代理实现

小结

贪吃蛇实现

在深入探讨人工智能如何掌握贪吃蛇游戏之前，让我们先回顾一下贪吃蛇游戏的基本设计和规则。贪吃蛇是一款经典的电子游戏，其简单的规则和直观的游戏玩法使其成为了历史上最受欢迎的游戏之一。

游戏规则

在贪吃蛇游戏中，玩家控制一条不断移动的蛇，游戏目标是吃掉出现在屏幕上的食物，每吃掉一个食物，蛇的长度就会增加。游戏的挑战在于蛇不能触碰到屏幕边缘或自己的身体，否则游戏结束。随着蛇的不断增长，避免撞到自己变得越来越难，这就需要玩家有较高的策略和反应能力。

游戏实现

在我们的Python实现中，游戏界面使用Pygame库创建，设置了固定的宽度和高度，以及基本的颜色定义，如黑色背景、绿色蛇身和红色食物。游戏中的蛇和食物都以方块的形式表示，每个方块的大小由block_size变量定义。游戏的主体是SnakeGame类，它封装了游戏的所有基本逻辑，包括蛇的初始化、移动、食物的生成以及碰撞检测等。

width, height = 640, 480  # 游戏窗口的宽度和高度
block_size = 20  # 每个方块的大小，包括蛇的每一部分和食物
class SnakeGame:def __init__(self, width, height, block_size):self.width = widthself.height = heightself.block_size = block_sizeself.reset()

在reset方法中，游戏被重置到初始状态，蛇回到屏幕中心，长度为3个方块，方向向上。同时，食物被随机放置在屏幕上的某个位置。

def reset(self):initial_position = [self.width // 2, self.height // 2]self.snake = [initial_position,[initial_position[0] - self.block_size, initial_position[1]],[initial_position[0] - 2 * self.block_size, initial_position[1]]]self.direction = 'UP'self.food = [random.randrange(1, self.width // self.block_size) * self.block_size,random.randrange(1, self.height // self.block_size) * self.block_size]self.score = 0self.done = Falsereturn self.get_state()

蛇的移动是通过在蛇头前面添加一个新的方块，并在蛇尾去掉一个方块来实现的。如果蛇吃到了食物，就不去掉蛇尾的方块，从而使蛇的长度增加。

        在游戏的每一步，step方法都会被调用，它接收一个动作（上、下、左、右），然后更新游戏状态，包括蛇的位置、游戏得分和游戏是否结束。这个方法是连接游戏逻辑和AI代理的桥梁，AI代理将根据游戏的当前状态决定下一步的最佳动作。

完整的游戏类：

class SnakeGame:def __init__(self, width, height, block_size):self.width = widthself.height = heightself.block_size = block_sizeself.reset()def reset(self):initial_position = [self.width // 2, self.height // 2]self.snake = [initial_position,[initial_position[0] - self.block_size, initial_position[1]],[initial_position[0] - 2 * self.block_size, initial_position[1]]]self.direction = 'UP'self.food = [random.randrange(1, self.width // self.block_size) * self.block_size,random.randrange(1, self.height // self.block_size) * self.block_size]self.score = 0self.done = Falsereturn self.get_state()def step(self, action):directions = ['UP', 'DOWN', 'LEFT', 'RIGHT']if self.direction != 'UP' and action == 1:self.direction = 'DOWN'elif self.direction != 'DOWN' and action == 0:self.direction = 'UP'elif self.direction != 'LEFT' and action == 3:self.direction = 'RIGHT'elif self.direction != 'RIGHT' and action == 2:self.direction = 'LEFT'# 计算食物的距离以判断是否靠近食物distance_to_food_before = self.distance(self.snake[0], self.food)# 移动蛇x, y = self.snake[0]if self.direction == 'UP':y -= block_sizeelif self.direction == 'DOWN':y += block_sizeelif self.direction == 'LEFT':x -= block_sizeelif self.direction == 'RIGHT':x += block_sizenew_head = [x, y]if x < 0 or x >= width or y < 0 or y >= height:self.done = Truereward = -500  # 撞墙的惩罚return self.get_state(), reward, self.doneif new_head in self.snake:self.done = Truereward = -200  # 撞到自身的严重惩罚return self.get_state(), reward, self.doneself.snake.insert(0, new_head)if new_head == self.food:self.score += 1reward = 500  # 吃到食物的奖励self.food = [random.randrange(1, width // block_size) * block_size,random.randrange(1, height // block_size) * block_size]else:self.snake.pop()distance_to_food_after = self.distance(new_head, self.food)if distance_to_food_after < distance_to_food_before:reward = 30  # 靠近食物的奖励else:reward = -15  # 无进展的小惩罚# 检查是否长时间直线移动if len(set([part[0] for part in self.snake])) == 1 or len(set([part[1] for part in self.snake])) == 1:if len(self.snake) > 5:  # 如果蛇的长度超过一定阈值reward -= 5  # 长时间直线移动的小惩罚return self.get_state(), reward, self.donedef distance(self, point1, point2):return ((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2) ** 0.5def get_state(self):head = self.snake[0]point_l = [head[0] - self.block_size, head[1]]point_r = [head[0] + self.block_size, head[1]]point_u = [head[0], head[1] - self.block_size]point_d = [head[0], head[1] + self.block_size]state = [head[0] / self.width, head[1] / self.height,  # 蛇头的相对位置self.food[0] / self.width, self.food[1] / self.height,  # 食物的相对位置int(self.direction == 'LEFT'), int(self.direction == 'RIGHT'),int(self.direction == 'UP'), int(self.direction == 'DOWN'),  # 蛇头的方向int(self.is_collision(point_l)), int(self.is_collision(point_r)),int(self.is_collision(point_u)), int(self.is_collision(point_d))  # 周围是否会发生碰撞]return np.array(state, dtype=float)def is_collision(self, point):return point[0] < 0 or point[0] >= self.width or point[1] < 0 or point[1] >= self.height or point in self.snakedef display(self, screen):screen.fill(black)for part in self.snake:pygame.draw.rect(screen, green, pygame.Rect(part[0], part[1], self.block_size, self.block_size))pygame.draw.rect(screen, red, pygame.Rect(self.food[0], self.food[1], self.block_size, self.block_size))pygame.display.flip()

Q学习算法实现

Q学习是一种无模型的强化学习算法，广泛应用于各种决策过程，包括我们的贪吃蛇游戏。它使代理能够在与环境交互的过程中学习，从而确定在给定状态下采取哪种动作以最大化未来的奖励。接下来，我们将深入探讨Q学习的核心概念和工作原理，并结合前文提到的贪吃蛇游戏代码进行说明。

Q学习简介

Q学习的目标是学习一个策略，告诉代理在特定状态下应该采取什么动作以获得最大的长期奖励。这种策略通过一个叫做Q函数的东西来表示，它为每个状态-动作对分配一个值（Q值）。这个值代表了在给定状态下采取某个动作，并遵循当前策略的情况下，预期能获得的总奖励。

Q表和Q值

在实践中，Q函数通常通过一个表格（Q表）来实现，表中的每一行对应一个可能的状态，每一列对应一个可能的动作，表中的每个元素表示该状态-动作对的Q值。在我们的贪吃蛇游戏中，状态可以是蛇头的位置、食物的位置以及蛇头的方向等信息的组合，动作则是蛇的移动方向（上、下、左、右）。

Q学习更新规则

Q学习的核心是其更新规则，它定义了如何根据代理的经验逐步调整Q值。更新规则如下：

Q学习在贪吃蛇游戏中的应用

在贪吃蛇游戏的实现中，我们首先定义了一个QLearningAgent类，它包含了学习率、折扣率和探索率等参数，以及一个空的Q表来存储Q值。

class QLearningAgent:def __init__(self, learning_rate=0.5, discount_rate=0.8, exploration_rate=0.45):self.learning_rate = learning_rateself.discount_rate = discount_rateself.exploration_rate = exploration_rateself.q_table = {}

代理通过get_action方法根据当前状态选择动作，这里使用了ϵ-贪婪策略来平衡探索和利用：

def get_action(self, state):state_key = str(state)if state_key not in self.q_table:self.q_table[state_key] = np.zeros(4)if np.random.rand() < self.exploration_rate:return random.randint(0, 3)  # 探索else:return np.argmax(self.q_table[state_key])  # 利用

在每一步之后，update_q_table方法会根据Q学习的更新规则来调整Q值：

def update_q_table(self, state, action, reward, next_state, done):state_key = str(state)next_state_key = str(next_state)if next_state_key not in self.q_table:self.q_table[next_state_key] = np.zeros(4)next_max = np.max(self.q_table[next_state_key])self.q_table[state_key][action] = (1 - self.learning_rate) * self.q_table[state_key][action] + \self.learning_rate * (reward + self.discount_rate * next_max)

通过这种方式，Q学习代理能够在与环境的交互中逐渐学习到在各种状态下应该采取的最佳动作，从而在贪吃蛇游戏中获得尽可能高的分数。随着训练的进行，代理会变得越来越熟练，最终能够展示出高水平的游戏技巧。

整体项目完整代码

 import time
import pygame
import random
import numpy as npwidth, height = 640, 480black = (0, 0, 0)
green = (0, 255, 0)
red = (255, 0, 0)block_size = 20
speed = 15class SnakeGame:def __init__(self, width, height, block_size):self.width = widthself.height = heightself.block_size = block_sizeself.reset()def reset(self):initial_position = [self.width // 2, self.height // 2]self.snake = [initial_position,[initial_position[0] - self.block_size, initial_position[1]],[initial_position[0] - 2 * self.block_size, initial_position[1]]]self.direction = 'UP'self.food = [random.randrange(1, self.width // self.block_size) * self.block_size,random.randrange(1, self.height // self.block_size) * self.block_size]self.score = 0self.done = Falsereturn self.get_state()def step(self, action):directions = ['UP', 'DOWN', 'LEFT', 'RIGHT']if self.direction != 'UP' and action == 1:self.direction = 'DOWN'elif self.direction != 'DOWN' and action == 0:self.direction = 'UP'elif self.direction != 'LEFT' and action == 3:self.direction = 'RIGHT'elif self.direction != 'RIGHT' and action == 2:self.direction = 'LEFT'# 计算食物的距离以判断是否靠近食物distance_to_food_before = self.distance(self.snake[0], self.food)# 移动蛇x, y = self.snake[0]if self.direction == 'UP':y -= block_sizeelif self.direction == 'DOWN':y += block_sizeelif self.direction == 'LEFT':x -= block_sizeelif self.direction == 'RIGHT':x += block_sizenew_head = [x, y]if x < 0 or x >= width or y < 0 or y >= height:self.done = Truereward = -500  # 撞墙的惩罚return self.get_state(), reward, self.doneif new_head in self.snake:self.done = Truereward = -200  # 撞到自身的严重惩罚return self.get_state(), reward, self.doneself.snake.insert(0, new_head)if new_head == self.food:self.score += 1reward = 500  # 吃到食物的奖励self.food = [random.randrange(1, width // block_size) * block_size,random.randrange(1, height // block_size) * block_size]else:self.snake.pop()distance_to_food_after = self.distance(new_head, self.food)if distance_to_food_after < distance_to_food_before:reward = 30  # 靠近食物的奖励else:reward = -15  # 无进展的小惩罚# 检查是否长时间直线移动if len(set([part[0] for part in self.snake])) == 1 or len(set([part[1] for part in self.snake])) == 1:if len(self.snake) > 5:  # 如果蛇的长度超过一定阈值reward -= 5  # 长时间直线移动的小惩罚return self.get_state(), reward, self.donedef distance(self, point1, point2):return ((point1[0] - point2[0]) ** 2 + (point1[1] - point2[1]) ** 2) ** 0.5def get_state(self):head = self.snake[0]point_l = [head[0] - self.block_size, head[1]]point_r = [head[0] + self.block_size, head[1]]point_u = [head[0], head[1] - self.block_size]point_d = [head[0], head[1] + self.block_size]state = [head[0] / self.width, head[1] / self.height,  # 蛇头的相对位置self.food[0] / self.width, self.food[1] / self.height,  # 食物的相对位置int(self.direction == 'LEFT'), int(self.direction == 'RIGHT'),int(self.direction == 'UP'), int(self.direction == 'DOWN'),  # 蛇头的方向int(self.is_collision(point_l)), int(self.is_collision(point_r)),int(self.is_collision(point_u)), int(self.is_collision(point_d))  # 周围是否会发生碰撞]return np.array(state, dtype=float)def is_collision(self, point):return point[0] < 0 or point[0] >= self.width or point[1] < 0 or point[1] >= self.height or point in self.snakedef display(self, screen):screen.fill(black)for part in self.snake:pygame.draw.rect(screen, green, pygame.Rect(part[0], part[1], self.block_size, self.block_size))pygame.draw.rect(screen, red, pygame.Rect(self.food[0], self.food[1], self.block_size, self.block_size))pygame.display.flip()class QLearningAgent:def __init__(self, learning_rate=0.5, discount_rate=0.8, exploration_rate=0.45):self.learning_rate = learning_rateself.discount_rate = discount_rateself.exploration_rate = exploration_rateself.q_table = {}def get_action(self, state):state_key = str(state)if state_key not in self.q_table:self.q_table[state_key] = np.zeros(4)if np.random.rand() < self.exploration_rate:return random.randint(0, 3)else:return np.argmax(self.q_table[state_key])def update_q_table(self, state, action, reward, next_state, done):state_key = str(state)next_state_key = str(next_state)if next_state_key not in self.q_table:self.q_table[next_state_key] = np.zeros(4)next_max = np.max(self.q_table[next_state_key])self.q_table[state_key][action] = (1 - self.learning_rate) * self.q_table[state_key][action] + \self.learning_rate * (reward + self.discount_rate * next_max)if done:self.exploration_rate *= 0.9999if self.exploration_rate < 0.1:self.exploration_rate = 0.3game = SnakeGame(width, height, block_size)
agent = QLearningAgent()
num_episodes = 10000000
visualization_interval = 10000
for episode in range(num_episodes):if episode % visualization_interval == 1:pygame.init()screen = pygame.display.set_mode((width, height))visualize = Trueelse:visualize = Falsestate = game.reset()done = Falsetotal_reward = 0while not done:if visualize:game.display(screen)clock = pygame.time.Clock()clock.tick(12)action = agent.get_action(state)next_state, reward, done = game.step(action)agent.update_q_table(state, action, reward, next_state, done)state = next_statetotal_reward += rewardif visualize:pygame.quit()print(f"轮次: {episode}, 奖励分数: {total_reward}, 探索率: {agent.exploration_rate}, {agent}")

运行过程截图

代码分析

环境设置

游戏环境是通过Pygame库来创建和管理的。首先，我们初始化Pygame并设置游戏窗口的尺寸。接着，定义了几种基本颜色用于绘制游戏元素，如蛇身和食物。

import pygame
import randomwidth, height = 640, 480  # 游戏窗口的宽度和高度
black = (0, 0, 0)  # 背景颜色
green = (0, 255, 0)  # 蛇的颜色
red = (255, 0, 0)  # 食物的颜色
block_size = 20  # 每个方块的大小

蛇的行为定义

SnakeGame类封装了蛇的行为和游戏逻辑。蛇的初始状态是位于屏幕中心，长度为3个方块，方向向上。蛇的每一部分都是一个方块，我们通过一个列表来维护这些方块的坐标。

Q学习代理实现

QLearningAgent类实现了Q学习算法。它使用一个Q表来存储和更新状态-动作对的Q值，并根据ϵ-贪婪策略来选择动作。

Q学习代理实现

QLearningAgent类实现了Q学习算法。它使用一个Q表来存储和更新状态-动作对的Q值，并根据ϵ-贪婪策略来选择动作。

game = SnakeGame(width, height, block_size)
agent = QLearningAgent(learning_rate=0.5, discount_rate=0.8, exploration_rate=0.45)for episode in range(num_episodes):state = game.reset()done = Falsewhile not done:action = agent.get_action(state)next_state, reward, done = game.step(action)agent.update_q_table(state, action, reward, next_state, done)state = next_state

通过这个过程，Q学习代理逐渐学会了如何在贪吃蛇游戏中做出更好的决策，从而获得更高的分数。随着训练的进行，我们可以观察到代理的性能逐渐提高，这表明它正在学习和适应游戏环境。