Tensorflow C3D完成视频动作识别

本文是视频动作识别领域经典的C3D网络的简易实现，可以作为动作识别的入门。论文为<Learning Spatiotemporal Features with 3D Convolutional Networks>(ICCV 2015)。

框架：Tensorflow (=1.6)+python(2.7)+slim

数据集：UCF101. Center for Research in Computer Vision at the University of Central Florida

代码：2012013382/C3D-Tensorflow-slim

3D卷积的基本概念网上有很多，不再阐述。这里主要说一下输入帧（图片）通过网络之后的变化情况。

C3D的基本网络结构如图1所示：

图1 C3D网络结构示意图

细节：

1）输入clip（视频段）的shape为[batch_size, frame_length, crop_size, crop_size, channel_num]，其中frame_length为16，表示输入为16帧一个样本；crop_size为112，channel_num为3，表示每帧的size统一为[112, 112, 3]。

2）每个卷积核的size都是[3, 3, 3]，第一维表示时间维，后面两维表示帧（图片）上的kernel size；stride都是[1, 1, 1], padding='SAME'。

3）所有的pooling都是3D max pooling，只有第一层pooling的size和stride为[1, 2, 2]，其他的均为[2, 2, 2]，维数的含义与1）中一致，padding='SAME'。作者称第一层时间维用1是为了避免时间维度上过早缩小为1。

输入clip通过网络的shape变化如下：

设batch_size为10。

Input shape:[10, 16, 112, 112, 3]

After conv1:[10, 16, 112, 112, 64]

After pool1:[10, 16, 56, 56, 64]

After conv2a:[10, 16, 56, 56, 128]

After pool2:[10, 8, 28, 28, 128]

After conv3a:[10, 8, 28, 28, 256]

After conv3b:[10, 8, 28, 28, 256]

After pool3:[10, 4, 14, 14, 256]

After conv4a:[10, 4, 14, 14, 512]

After conv4b:[10, 4, 14, 14, 512]

After pool4:[10, 2, 7, 7, 512]

After conv5a:[10, 2, 7, 7, 512]

After conv5b:[10, 2, 7, 7, 512]

After pool5:[10, 1, 4, 4, 512]

After fc6:[10, 4096]

After fc7:[10, 4096]

out:[10, num_classes](UCF的num_classes为101)

数据预处理

做视频的工作，数据预处理相对会比较复杂，由于视频数据集通常较大，我们通常将其先转为图片的形式，再每次从硬盘上读一个batch的数据。下载UCF101数据集之后，将其解压到项目的根目录下。创建convert_video_to_images.sh文件，内容为

for folder in $1/*
dofor file in "$folder"/*.avidoif [[ ! -d "${file[@]%.avi}" ]]; thenmkdir -p "${file[@]%.avi}"fiffmpeg -i "$file" -vf fps=$2 "${file[@]%.avi}"/%05d.jpgrm "$file"done
done

执行

sudo ./convert_video_to_images.sh UCF101/ 5

表示将视频每秒取5帧图片。

之后生成训练集与测试集。创建convert_images_to_list.sh文件，内容为

> train.list
> test.list
COUNT=-1
for folder in $1/*
doCOUNT=$[$COUNT + 1]for imagesFolder in "$folder"/*doif (( $(jot -r 1 1 $2)  > 1 )); thenecho "$imagesFolder" $COUNT >> train.listelseecho "$imagesFolder" $COUNT >> test.listfi        done
done

执行

./convert_images_to_list.sh UCF101/ 4

表示1/4的数据为测试集，其余为训练集。

每次为训练和测试从硬盘上读取batch_size大小的数据，具体如下。

from __future__ import absolute_import
from __future__ import division
from __future__ import print_functionimport PIL.Image as Image
import random
import numpy as np
import os
import time
CLIP_LENGTH = 16
import cv2
VALIDATION_PRO = 0.2np_mean = np.load('crop_mean.npy').reshape([CLIP_LENGTH, 112, 112, 3])
def get_test_num(filename):lines = open(filename, 'r')return len(list(lines))
def get_video_indices(filename):lines = open(filename, 'r')#Shuffle datalines = list(lines)video_indices = range(len(lines))random.seed(time.time())random.shuffle(video_indices)validation_video_indices = video_indices[:int(len(video_indices) * 0.2)]train_video_indices = video_indices[int(len(video_indices) * 0.2):]return train_video_indices, validation_video_indicesdef frame_process(clip, clip_length=CLIP_LENGTH, crop_size=112, channel_num=3):frames_num = len(clip)croped_frames = np.zeros([frames_num, crop_size, crop_size, channel_num]).astype(np.float32)#Crop every frame into shape[crop_size, crop_size, channel_num]for i in range(frames_num):img = Image.fromarray(clip[i].astype(np.uint8))if img.width > img.height:scale = float(crop_size) / float(img.height)img = np.array(cv2.resize(np.array(img), (int(img.width * scale + 1), crop_size))).astype(np.float32)else:scale = float(crop_size) / float(img.width)img = np.array(cv2.resize(np.array(img), (crop_size, int(img.height * scale + 1)))).astype(np.float32)crop_x = int((img.shape[0] - crop_size) / 2)crop_y = int((img.shape[1] - crop_size) / 2)img = img[crop_x: crop_x + crop_size, crop_y : crop_y + crop_size, :]croped_frames[i, :, :, :] = img - np_mean[i]return croped_framesdef convert_images_to_clip(filename, clip_length=CLIP_LENGTH, crop_size=112, channel_num=3):clip = []for parent, dirnames, filenames in os.walk(filename):filenames = sorted(filenames)if len(filenames) < clip_length:for i in range(0, len(filenames)):image_name = str(filename) + '/' + str(filenames[i])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)for i in range(clip_length - len(filenames)):image_name = str(filename) + '/' + str(filenames[len(filenames) - 1])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)else:s_index = random.randint(0, len(filenames) - clip_length)for i in range(s_index, s_index + clip_length):image_name = str(filename) + '/' + str(filenames[i])img = Image.open(image_name)img_data = np.array(img)clip.append(img_data)if len(clip) == 0:print(filename)clip = frame_process(clip, clip_length, crop_size, channel_num)return clip#shape[clip_length, crop_size, crop_size, channel_num]def get_batches(filename, num_classes, batch_index, video_indices, batch_size=10, crop_size=112, channel_num=3):lines = open(filename, 'r')clips = []labels = []lines = list(lines)for i in video_indices[batch_index: batch_index + batch_size]:line = lines[i].strip('\n').split()dirname = line[0]label = line[1]i_clip = convert_images_to_clip(dirname, CLIP_LENGTH, crop_size, channel_num)clips.append(i_clip)labels.append(int(label))clips = np.array(clips).astype(np.float32)labels = np.array(labels).astype(np.int64)oh_labels = np.zeros([len(labels), num_classes]).astype(np.int64)for i in range(len(labels)):oh_labels[i, labels[i]] = 1batch_index = batch_index + batch_size#Convert to numpybatch_data = {'clips': clips, 'labels': oh_labels}return batch_data, batch_index

这里需要注意的是：为了简便，我每一个视频随机抽取一个连续的16帧组成clip，作为一个样本，如果batch_size为10，那么就是取了10个视频，每个视频随机取16帧，组成了10 clips，作为每次网络的输入。

模型使用slim，因为实现简单，阅读容易。

import tensorflow as tf
import tensorflow.contrib.slim as slimdef C3D(input, num_classes, keep_pro=0.5):with tf.variable_scope('C3D'):with slim.arg_scope([slim.conv3d],padding='SAME',weights_regularizer=slim.l2_regularizer(0.0005),activation_fn=tf.nn.relu,kernel_size=[3, 3, 3],stride=[1, 1, 1]):net = slim.conv3d(input, 64, scope='conv1')net = slim.max_pool3d(net, kernel_size=[1, 2, 2], stride=[1, 2, 2], padding='SAME', scope='max_pool1')net = slim.conv3d(net, 128, scope='conv2')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool2')net = slim.repeat(net, 2, slim.conv3d, 256, scope='conv3')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool3')net = slim.repeat(net, 2, slim.conv3d, 512, scope='conv4')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool4')net = slim.repeat(net, 2, slim.conv3d, 512, scope='conv5')net = slim.max_pool3d(net, kernel_size=[2, 2, 2], stride=[2, 2, 2], padding='SAME', scope='max_pool5')net = tf.reshape(net, [-1, 512 * 4 * 4])net = slim.fully_connected(net, 4096, weights_regularizer=slim.l2_regularizer(0.0005), scope='fc6')net = slim.dropout(net, keep_pro, scope='dropout1')net = slim.fully_connected(net, 4096, weights_regularizer=slim.l2_regularizer(0.0005), scope='fc7')net = slim.dropout(net, keep_pro, scope='dropout2')out = slim.fully_connected(net, num_classes, weights_regularizer=slim.l2_regularizer(0.0005), \activation_fn=None, scope='out')return out

训练

import tensorflow as tf
import numpy as np
import C3D_model
import time
import data_processing
import os
import os.path
from os.path import join
TRAIN_LOG_DIR = os.path.join('Log/train/', time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time())))
TRAIN_CHECK_POINT = 'check_point/'
TRAIN_LIST_PATH = 'train.list'
TEST_LIST_PATH = 'test.list'
BATCH_SIZE = 10
NUM_CLASSES = 101
CROP_SZIE = 112
CHANNEL_NUM = 3
CLIP_LENGTH = 16
EPOCH_NUM = 50
INITIAL_LEARNING_RATE = 1e-4
LR_DECAY_FACTOR = 0.5
EPOCHS_PER_LR_DECAY = 2
MOVING_AV_DECAY = 0.9999
#Get shuffle index
train_video_indices, validation_video_indices = data_processing.get_video_indices(TRAIN_LIST_PATH)with tf.Graph().as_default():batch_clips = tf.placeholder(tf.float32, [BATCH_SIZE, CLIP_LENGTH, CROP_SZIE, CROP_SZIE, CHANNEL_NUM], name='X')batch_labels = tf.placeholder(tf.int32, [BATCH_SIZE, NUM_CLASSES], name='Y')keep_prob = tf.placeholder(tf.float32)logits = C3D_model.C3D(batch_clips, NUM_CLASSES, keep_prob)with tf.name_scope('loss'):loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=batch_labels))tf.summary.scalar('entropy_loss', loss)with tf.name_scope('accuracy'):accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.argmax(batch_labels, 1)), np.float32))tf.summary.scalar('accuracy', accuracy)#global_step = tf.Variable(0, name='global_step', trainable=False) #decay_step = EPOCHS_PER_LR_DECAY * len(train_video_indices) // BATCH_SIZElearning_rate = 1e-4#tf.train.exponential_decay(INITIAL_LEARNING_RATE, global_step, decay_step, LR_DECAY_FACTOR, staircase=True)optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)#, global_step=global_step)saver = tf.train.Saver()summary_op = tf.summary.merge_all()config = tf.ConfigProto()config.gpu_options.allow_growth = Truewith tf.Session(config=config) as sess:train_summary_writer = tf.summary.FileWriter(TRAIN_LOG_DIR, sess.graph)sess.run(tf.global_variables_initializer())sess.run(tf.local_variables_initializer())step = 0for epoch in range(EPOCH_NUM):accuracy_epoch = 0loss_epoch = 0batch_index = 0for i in range(len(train_video_indices) // BATCH_SIZE):step += 1batch_data, batch_index = data_processing.get_batches(TRAIN_LIST_PATH, NUM_CLASSES, batch_index,train_video_indices, BATCH_SIZE)_, loss_out, accuracy_out, summary = sess.run([optimizer, loss, accuracy, summary_op],feed_dict={batch_clips:batch_data['clips'],batch_labels:batch_data['labels'],keep_prob: 0.5})loss_epoch += loss_outaccuracy_epoch += accuracy_outif i % 10 == 0:print('Epoch %d, Batch %d: Loss is %.5f; Accuracy is %.5f'%(epoch+1, i, loss_out, accuracy_out))train_summary_writer.add_summary(summary, step)print('Epoch %d: Average loss is: %.5f; Average accuracy is: %.5f'%(epoch+1, loss_epoch / (len(train_video_indices) // BATCH_SIZE),accuracy_epoch / (len(train_video_indices) // BATCH_SIZE)))accuracy_epoch = 0loss_epoch = 0batch_index = 0for i in range(len(validation_video_indices) // BATCH_SIZE):batch_data, batch_index = data_processing.get_batches(TRAIN_LIST_PATH, NUM_CLASSES, batch_index,validation_video_indices, BATCH_SIZE)loss_out, accuracy_out = sess.run([loss, accuracy],feed_dict={batch_clips:batch_data['clips'],batch_labels:batch_data['labels'],keep_prob: 1.0})loss_epoch += loss_outaccuracy_epoch += accuracy_outprint('Validation loss is %.5f; Accuracy is %.5f'%(loss_epoch / (len(validation_video_indices) // BATCH_SIZE),accuracy_epoch /(len(validation_video_indices) // BATCH_SIZE)))saver.save(sess, TRAIN_CHECK_POINT + 'train.ckpt', global_step=epoch)

这里取训练集的20%作为验证集，在训练集上每跑完一个epoch，就在验证集上验证一次。

测试

import tensorflow as tf
import numpy as np
import C3D_model
import data_processing
TRAIN_LOG_DIR = 'Log/train/'
TRAIN_CHECK_POINT = 'check_point/train.ckpt-36'
TEST_LIST_PATH = 'test.list'
BATCH_SIZE = 10
NUM_CLASSES = 101
CROP_SZIE = 112
CHANNEL_NUM = 3
CLIP_LENGTH = 16
EPOCH_NUM = 50
test_num = data_processing.get_test_num(TEST_LIST_PATH)test_video_indices = range(test_num)with tf.Graph().as_default():batch_clips = tf.placeholder(tf.float32, [BATCH_SIZE, CLIP_LENGTH, CROP_SZIE, CROP_SZIE, CHANNEL_NUM], name='X')batch_labels = tf.placeholder(tf.int32, [BATCH_SIZE, NUM_CLASSES], name='Y')keep_prob = tf.placeholder(tf.float32)logits = C3D_model.C3D(batch_clips, NUM_CLASSES, keep_prob)accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(logits, 1), tf.argmax(batch_labels, 1)), np.float32))restorer = tf.train.Saver()config = tf.ConfigProto()config.gpu_options.allow_growth = Truewith tf.Session(config=config) as sess:sess.run(tf.global_variables_initializer())sess.run(tf.local_variables_initializer())restorer.restore(sess, TRAIN_CHECK_POINT)accuracy_epoch = 0batch_index = 0for i in range(test_num // BATCH_SIZE):if i % 10 == 0:print('Testing %d of %d'%(i + 1, test_num // BATCH_SIZE))batch_data, batch_index = data_processing.get_batches(TEST_LIST_PATH, NUM_CLASSES, batch_index,test_video_indices, BATCH_SIZE)accuracy_out = sess.run(accuracy,feed_dict={batch_clips: batch_data['clips'],batch_labels: batch_data['labels'],keep_prob: 1.0})accuracy_epoch += accuracy_outprint('Test accuracy is %.5f' % (accuracy_epoch / (test_num // BATCH_SIZE)))