政安晨：【Keras机器学习实践要点】（十三）—— 利用 TensorFlow 进行多 GPU 分布式训练

前言

设置

单主机、多设备同步培训

工作原理

如何使用

使用回调确保容错

tf.data 性能提示

数据集批处理注意事项

调用 dataset.cache()

调用 dataset.prefetch(buffer_size)

政安晨的个人主页：政安晨

欢迎 👍点赞✍评论⭐收藏

收录专栏: TensorFlow与Keras机器学习实战

希望政安晨的博客能够对您有所裨益，如有不足之处，欢迎在评论区提出指正！

本文是使用 TensorFlow 对 Keras 模型进行多 GPU 训练的指南。

前言

在多台设备之间分配计算通常有两种方法：

数据并行，即在多个设备或多台机器上复制单个模型。它们各自处理不同批次的数据，然后合并结果。这种设置有很多变体，不同的模型副本合并结果的方式不同，它们是在每个批次保持同步，还是更松散地耦合等。

模型并行，即一个模型的不同部分在不同设备上运行，同时处理一批数据。这种方法最适用于具有天然并行架构的模型，例如具有多个分支的模型。

本指南侧重于数据并行性，尤其是同步数据并行性，即模型的不同副本在处理每个批次后保持同步。同步性可使模型收敛行为与单设备训练时的收敛行为保持一致。

具体来说，本文将教您如何使用 tf.distribute API 在单台机器上安装的多个 GPU（通常为 2 到 16 个）上对 Keras 模型进行训练，只需对代码进行最小的修改（单主机、多设备训练）。这是研究人员和小规模行业工作流程最常见的配置。

设置

import osos.environ["KERAS_BACKEND"] = "tensorflow"import tensorflow as tf
import keras

单主机、多设备同步培训

在这种设置中，一台机器上有多个 GPU（通常为 2 到 16 个）。每个设备将运行一个模型副本（称为副本）。为简单起见，在下文中，我们将假设使用 8 个 GPU，但这并不影响其通用性。

工作原理

训练的每个阶段：

当前批次的数据（称为全局批次）会被分成 8 个不同的子批次（称为局部批次）。例如，如果全局批次有 512 个样本，那么 8 个局部批次中的每个批次将有 64 个样本。
8 个副本中的每个副本都会独立处理一个本地批次：它们先运行一个前向传递，然后运行一个后向传递，输出权重相对于本地批次上模型损失的梯度。
源于本地梯度的权重更新会在 8 个副本中有效合并。由于这是在每一步结束时进行的，因此各副本始终保持同步。

实际上，同步更新模型副本权重的过程是在每个权重变量的层面上进行的。这是通过镜像变量对象完成的。

如何使用

要使用 Keras 模型进行单主机、多设备同步训练，您需要使用 tf.distribute.MirroredStrategy API。下面是其工作原理：

实例化 MirroredStrategy，可选择配置要使用的特定设备（默认情况下，该策略将使用所有可用的 GPU）。
使用该策略对象打开一个作用域，并在该作用域中创建所需的包含变量的所有 Keras 对象。通常，这意味着在分发作用域内创建和编译模型。在某些情况下，对 fit() 的首次调用也可能会创建变量，因此最好也将 fit() 调用放在该作用域中。
像往常一样通过 fit() 训练模型。

重要的是，我们建议您使用 tf.data.Dataset 对象在多设备或分布式工作流中加载数据。

从结构上看，是这样的：

# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))# Open a strategy scope.
with strategy.scope():# Everything that creates variables should be under the strategy scope.# In general this is only model construction & `compile()`.model = Model(...)model.compile(...)# Train the model on all available devices.model.fit(train_dataset, validation_data=val_dataset, ...)# Test the model on all available devices.model.evaluate(test_dataset)

下面是一个简单的端到端可运行示例：

def get_compiled_model():# Make a simple 2-layer densely-connected neural network.inputs = keras.Input(shape=(784,))x = keras.layers.Dense(256, activation="relu")(inputs)x = keras.layers.Dense(256, activation="relu")(x)outputs = keras.layers.Dense(10)(x)model = keras.Model(inputs, outputs)model.compile(optimizer=keras.optimizers.Adam(),loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),metrics=[keras.metrics.SparseCategoricalAccuracy()],)return modeldef get_dataset():batch_size = 32num_val_samples = 10000# Return the MNIST dataset in the form of a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()# Preprocess the data (these are Numpy arrays)x_train = x_train.reshape(-1, 784).astype("float32") / 255x_test = x_test.reshape(-1, 784).astype("float32") / 255y_train = y_train.astype("float32")y_test = y_test.astype("float32")# Reserve num_val_samples samples for validationx_val = x_train[-num_val_samples:]y_val = y_train[-num_val_samples:]x_train = x_train[:-num_val_samples]y_train = y_train[:-num_val_samples]return (tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size),tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size),tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size),)# Create a MirroredStrategy.
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))# Open a strategy scope.
with strategy.scope():# Everything that creates variables should be under the strategy scope.# In general this is only model construction & `compile()`.model = get_compiled_model()# Train the model on all available devices.train_dataset, val_dataset, test_dataset = get_dataset()model.fit(train_dataset, epochs=2, validation_data=val_dataset)# Test the model on all available devices.model.evaluate(test_dataset)

结果如下：

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Number of devices: 1
Epoch 1/21563/1563 ━━━━━━━━━━━━━━━━━━━━ 7s 4ms/step - loss: 0.3830 - sparse_categorical_accuracy: 0.8884 - val_loss: 0.1361 - val_sparse_categorical_accuracy: 0.9574
Epoch 2/21563/1563 ━━━━━━━━━━━━━━━━━━━━ 9s 3ms/step - loss: 0.1068 - sparse_categorical_accuracy: 0.9671 - val_loss: 0.0894 - val_sparse_categorical_accuracy: 0.9724313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - loss: 0.0988 - sparse_categorical_accuracy: 0.9673

使用回调确保容错

使用分布式训练时，应始终确保有从故障中恢复的策略（容错）。最简单的处理方法是将 ModelCheckpoint 回调传递给 fit()，以定期保存模型（例如每 100 个批次或每个历元）。然后，您可以从保存的模型重新开始训练。

这里有一个简单的例子：

# Prepare a directory to store all the checkpoints.
checkpoint_dir = "./ckpt"
if not os.path.exists(checkpoint_dir):os.makedirs(checkpoint_dir)def make_or_restore_model():# Either restore the latest model, or create a fresh one# if there is no checkpoint available.checkpoints = [checkpoint_dir + "/" + name for name in os.listdir(checkpoint_dir)]if checkpoints:latest_checkpoint = max(checkpoints, key=os.path.getctime)print("Restoring from", latest_checkpoint)return keras.models.load_model(latest_checkpoint)print("Creating a new model")return get_compiled_model()def run_training(epochs=1):# Create a MirroredStrategy.strategy = tf.distribute.MirroredStrategy()# Open a strategy scope and create/restore the modelwith strategy.scope():model = make_or_restore_model()callbacks = [# This callback saves a SavedModel every epoch# We include the current epoch in the folder name.keras.callbacks.ModelCheckpoint(filepath=checkpoint_dir + "/ckpt-{epoch}.keras",save_freq="epoch",)]model.fit(train_dataset,epochs=epochs,callbacks=callbacks,validation_data=val_dataset,verbose=2,)# Running the first time creates the model
run_training(epochs=1)# Calling the same function again will resume from where we left off
run_training(epochs=1)

执行结果如下：

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Creating a new model
1563/1563 - 7s - 4ms/step - loss: 0.2275 - sparse_categorical_accuracy: 0.9320 - val_loss: 0.1373 - val_sparse_categorical_accuracy: 0.9571
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
Restoring from ./ckpt/ckpt-1.keras
1563/1563 - 6s - 4ms/step - loss: 0.0944 - sparse_categorical_accuracy: 0.9717 - val_loss: 0.0972 - val_sparse_categorical_accuracy: 0.9710