UFLDL教程：Exercise:Softmax Regression

Softmax分类函数的Python实现

Deep Learning and Unsupervised Feature Learning Tutorial Solutions

逻辑回归假设函数

在线性回归问题中，假设函数具有如下形式：

这里写图片描述

在 logistic 回归中，我们的训练集由m 个已标记的样本构成：这里写图片描述，其中输入特征。由于 logistic 回归是针对二分类问题的，因此类标记。逻辑回归的假设函数的输出值位于[0, 1]之间，所以，我们需要找到一个满足这个性质的假设函数。

在逻辑回归问题中，将该函数的形式转换为如下形式：

这里写图片描述

其中，函数g称为S型函数（sigmoid function）或者是逻辑函数（Logistic function）（这两个术语是可以互换的），它具有如下形式：

这里写图片描述

该函数图形如下图所示：

这里写图片描述

可以看到，S型函数的取值位于（0,1）之间，满足这里写图片描述。那么

逻辑回归的假设函数(hypothesis function) 可以为如下形式：

这里写图片描述

虽然该算法中有”回归”二字，但其实它并不是一种回归算法，而是一种分类算法。

成本函数Cost和代价函数J

介绍构造逻辑回归问题的成本函数Cost和代价函数J。给定了一个含有m个样本的训练样本集，每个样本具有n维特征。x(i)为第i个样本的特征向量，y(i)为第i个样本的分类标签，回归问题的假设函数为hθ(x)，那么，如何能够根据给定的训练样本集得到假设函数中的参数向量θ（也就是模型参数）呢？

参考线性回归分析中的代价函数J(θ)：

这里写图片描述

可以构造如下的代价函数J(θ)

这里写图片描述

其中成本函数Cost为（类比线性回归问题构造的）：

这里写图片描述

但是，这个成本函数Cost是关于θ的非凸函数，它的图形如下图左侧所示，梯度下降法不能保证该函数收敛到全局最小值，但梯度下降法可以保证凸函数（如下图右侧所示）收敛到全局最小解。所以，我们需要寻找另外的代价函数，使它是凸函数。

这里写图片描述

对于逻辑回归问题，构造如下式所示的成本函数：

这里写图片描述

也就是说，当样本标签为1时，利用-log(hθ(x))计算成本函数，该函数图像如下图左侧所示；当样本标签为0时，利用-log(1-hθ(x))计算成本函数，该函数图像如下图右侧所示。

这里写图片描述

从图中可以看出成本函数的意义：
当y=1（即标签为1）时
假设函数的输出值为1，则成本函数取值为0（实现了正确的分类，不需要付出代价）。
假设函数的输出值为0，则成本函数取值为无穷大（分类错误，付出了无穷大的代价）。

逻辑回归的极大使然推导

逻辑回归的理激活函数是sigmoid函数，可理解成一个被sigmoid函数归一化后的线性回归。因为sigmoid函数把实数映射到了[0,1]区间。给定有一个训练数据，构造它的似然函数（likelihood function）为

这里写图片描述

一般会使用最大释然求解参数，这时取一个负的log对数（negative logarithm），得到：

这里写图片描述

上式被称为交叉熵(cross entropy) loss函数，因为取了一个负对数，之前的最大化就变成了最小化，所以只需求解是交叉熵loss函数最小的参数。

对loss函数求导得到

这里写图片描述

到现在为止，我们已经得到了loss函数以及关于参数的偏导数，只需要通过梯度下降就可以得到参数的解。

简化成本函数Cost和代价函数J

下式给出了逻辑回归问题的代价函数J(θ)（即考虑了m个训练样本）

这里写图片描述

可以看到，在分类问题中，训练样本的y值永远是1或者0，所以可以据此来简化成本函数的书写（即将原来的分情况讨论的两个式子合并为一个式子），简化后的成本函数如下：

这里写图片描述

从而，逻辑回归的代价函数J(θ)可以写为：

这里写图片描述

通过最小化代价函数J(θ)可以得到最优的向量参数。

逻辑回归问题的梯度下降法

相应的损失函数为

这里写图片描述

优化问题的关键问题是代价函数对待优化参数的梯度和代价函数取值的求解。

利用下式计算代价函数取值

利用下式计算代价函数的偏导数

偏导数公式和向量化推导

偏导数的推导过程

这里写图片描述

向量化的推导过程

这里写图片描述

% Exercise 4 -- Logistic Regressionclear all; close all; clcx = load('ex4x.dat'); 
y = load('ex4y.dat');[m, n] = size(x);% Add intercept term to x
x = [ones(m, 1), x]; % Plot the training data
% Use different markers for positives and negatives
figure
pos = find(y); neg = find(y == 0);%find是找到的一个向量，其结果是find函数括号值为真时的值的编号
plot(x(pos, 2), x(pos,3), '+')
hold on
plot(x(neg, 2), x(neg, 3), 'o')
hold on
xlabel('Exam 1 score')
ylabel('Exam 2 score')% Initialize fitting parameters
theta = zeros(n+1, 1);% Define the sigmoid function
g = inline('1.0 ./ (1.0 + exp(-z))'); % Newton's method
MAX_ITR = 7;
J = zeros(MAX_ITR, 1);for i = 1:MAX_ITR% Calculate the hypothesis functionz = x * theta;h = g(z);%转换成logistic函数% Calculate gradient and hessian.% The formulas below are equivalent to the summation formulas% given in the lecture videos.grad = (1/m).*x' * (h-y);%梯度的矢量表示法H = (1/m).*x' * diag(h) * diag(1-h) * x;%hessian矩阵的矢量表示法% Calculate J (for testing convergence)J(i) =(1/m)*sum(-y.*log(h) - (1-y).*log(1-h));%损失函数的矢量表示法theta = theta - H\grad;%是这样子的吗？
end
% Display theta
theta% Calculate the probability that a student with
% Score 20 on exam 1 and score 80 on exam 2 
% will not be admitted
prob = 1 - g([1, 20, 80]*theta)%画出分界面
% Plot Newton's method result
% Only need 2 points to define a line, so choose two endpoints
plot_x = [min(x(:,2))-2,  max(x(:,2))+2];
% Calculate the decision boundary line，plot_y的计算公式见博客下面的评论。
plot_y = (-1./theta(3)).*(theta(2).*plot_x +theta(1));
plot(plot_x, plot_y)
legend('Admitted', 'Not admitted', 'Decision Boundary')
hold off% Plot J
figure
plot(0:MAX_ITR-1, J, 'o--', 'MarkerFaceColor', 'r', 'MarkerSize', 8)
xlabel('Iteration'); ylabel('J')
% Display J
J

Softmax的极大使然推导

在Logistic回归中，样本数据的值这里写图片描述，而在softmax回归中，其中k是类别种数，

比如在手写识别中k=10，表示要识别的10个数字。设

这里写图片描述

那么
这里写图片描述

而且有

这里写图片描述

为了将多项式模型表述成指数分布族，先引入T(y)，它是一个k-1维的向量，那么

这里写图片描述

应用于一般线性模型，y必然是属于k个类中的一种。用这里写图片描述表示为真，同样当为假时，有

，那么进一步得到联合分布的概率密度函数为

这里写图片描述

对比一下，可以得到

这里写图片描述

由于
这里写图片描述

那么最终得到

这里写图片描述

可以得到期望值为

这里写图片描述

接下来得到对数似然函数函数为

这里写图片描述

其中这里写图片描述是一个k(n+1)的矩阵，代表这k个类的所有训练参数，每个类的参数是一个n+1维的向量。所以在softmax回归中将x分类为类别 l 的概率为

这里写图片描述

跟Logistic回归一样，softmax也可以用梯度下降法或者牛顿迭代法求解，对对数似然函数求偏导数，得到

这里写图片描述

然后我们可以通过梯度上升法来更新参数

这里写图片描述

注意这里这里写图片描述是第 l 个类的所有参数，它是一个向量。

在softmax回归中直接用上述对数似然函数是不能更新参数的，因为它存在冗余的参数，通常用牛顿方法中的Hessian矩阵也不可逆，是一个非凸函数，那么可以通过添加一个权重衰减项来修改代价函数，使得代价函数是凸函数，并且得到的Hessian矩阵可逆。更多详情参考如下链接。
链接：http://deeplearning.stanford.edu/wiki/index.php/Softmax%E5%9B%9E%E5%BD%92

Softmax回归

在 softmax回归中，我们解决的是多分类问题（相对于 logistic 回归解决的二分类问题），类标 y 可以取 k 个不同的值（而不是 2 个）。因此，对于训练集这里写图片描述，我们有。（注意此处的类别下标从 1 开始，而不是 0）。

对于给定的测试输入 x，我们想用假设函数针对每一个类别j估算出概率值 p(y=j | x)。也就是说，我们想估计 x 的每一种分类结果出现的概率。因此，我们的假设函数将要输出一个 k 维的向量（向量元素的和为1）来表示这 k 个估计的概率值。具体地说，我们的假设函数 $\textstyle h_{\theta}(x)$ 形式如下：

这里写图片描述

其中 $\theta_1, \theta_2, \ldots, \theta_k \in \Re^{n+1}$ 是模型的参数。请注意这里写图片描述这一项对概率分布进行归一化，使得所有概率之和为 1

代价函数

这里写图片描述

这个公式是从逻辑回归的代价函数推广而来的。

这里写图片描述

注意在Softmax回归中将 x 分类为类别 j的概率为：

这里写图片描述

备注

逻辑回归的代价函数是根据极大似然估计推理得来，Softmax的代价函数也类似。

在逻辑回归中我们梯度下降法求解最优值，Softmax回归也是用梯度下降法求解最优值，梯度公式如下：

这里写图片描述

Softmax回归模型参数具有“冗余”性

冗余性指的是最优解不止一个，有多个。假设我们从参数向量这里写图片描述中减去了向量，这时，每一个都变成了。此时假设函数变成了以下的式子：

这里写图片描述

我们看到，从这里写图片描述中减去完全不影响假设函数的预测结果！这就是Softmax回归的冗余性。

权重衰减

针对上述的冗余性，我们应该怎么办呢？权重衰减可以解决这个问题。
我们通过添加一个权重衰减项这里写图片描述来修改代价函数，这个衰减项会惩罚过大的参数值，现在我们的代价函数变为：

这里写图片描述

有了这个权重衰减项以后 $(\textstyle \lambda > 0)$ ，代价函数就变成了严格的凸函数，这样就可以保证得到唯一的解了。

为了使用优化算法，我们需要求得这个新函数这里写图片描述的导数，如下：

这里写图片描述

通过最小化 $\textstyle J(\theta)$ ，我们就能实现一个可用的 softmax 回归模型。

Softmax回归与Logistic回归的关系

当类别数 k = 2 时，softmax 回归退化为 logistic 回归。这表明 softmax 回归是 logistic 回归的一般形式。具体地说，当 k = 2 时，softmax 回归的假设函数为

这里写图片描述

利用softmax回归参数冗余的特点，我们令 $\textstyle \psi = \theta_1$ ，并且从两个参数向量中都减去向量这里写图片描述，得到:

这里写图片描述

因此，用这里写图片描述来表示，我们就会发现 softmax 回归器预测其中一个类别的概率为，另一个类别概率的为，这与 logistic回归是一致的。

Softmax 回归 vs. k 个二元分类器

如果你在开发一个音乐分类的应用，需要对k种类型的音乐进行识别，那么是选择使用 softmax 分类器呢，还是使用 logistic 回归算法建立 k 个独立的二元分类器呢？

这一选择取决于你的类别之间是否互斥，例如，如果你有四个类别的音乐，分别为：古典音乐、乡村音乐、摇滚乐和爵士乐，那么你可以假设每个训练样本只会被打上一个标签（即：一首歌只能属于这四种音乐类型的其中一种），此时你应该使用类别数 k = 4 的softmax回归。（如果在你的数据集中，有的歌曲不属于以上四类的其中任何一类，那么你可以添加一个“其他类”，并将类别数 k 设为5。）

如果你的四个类别如下：人声音乐、舞曲、影视原声、流行歌曲，那么这些类别之间并不是互斥的。例如：一首歌曲可以来源于影视原声，同时也包含人声。这种情况下，使用4个二分类的 logistic 回归分类器更为合适。这样，对于每个新的音乐作品，我们的算法可以分别判断它是否属于各个类别。

现在我们来看一个计算视觉领域的例子，你的任务是将图像分到三个不同类别中。
(i) 假设这三个类别分别是：室内场景、户外城区场景、户外荒野场景。你会使用sofmax回归还是 3个logistic 回归分类器呢？
(ii) 现在假设这三个类别分别是室内场景、黑白图片、包含人物的图片，你又会选择 softmax 回归还是多个 logistic 回归分类器呢？

在第一个例子中，三个类别是互斥的，因此更适于选择softmax回归分类器 。
而在第二个例子中，建立三个独立的 logistic回归分类器更加合适。

实验步骤

1.初始化参数，加载训练数据集。注意：MNIST手写数字数据集所有图片的每个像素灰度值都已经被归一化到了[0，1]之间，所以将来如果是用自己的训练样本，不要忘记归一化像素值。
2.矢量化编程实现softmax回归的代价函数及其梯度，即softmaxCost.m文件。
3.利用computeNumericalGradient函数检查上一步中的梯度计算是否正确，该函数见Deep Learning一：Sparse Autoencoder练习（斯坦福大学UFLDL深度学习教程）。
4.用用L-BFGS算法训练softmax回归模型，得到模型softmaxModel，见softmaxTrain.m中的softmaxTrain函数
5.加载测试数据集，用上一步训练得到模型softmaxModel来对测试数据进行分类，得到分类结果（见softmaxPredict.m），然后计算正确率。

softmaxExercise.m

%% CS294A/CS294W Softmax Exercise %  Instructions
%  ------------
% 
%  This file contains code that helps you get started on the
%  softmax exercise. You will need to write the softmax cost function 
%  in softmaxCost.m and the softmax prediction function in softmaxPred.m. 
%  For this exercise, you will not need to change any code in this file,
%  or any other files other than those mentioned above.
%  (However, you may be required to do so in later exercises)%%======================================================================
%% STEP 0: Initialise constants and parameters
%
%  Here we define and initialise some constants which allow your code
%  to be used more generally on any arbitrary input. 
%  We also initialise some parameters used for tuning the model.inputSize = 28 * 28; % Size of input vector (MNIST images are 28x28)特征维数
numClasses = 10;     % Number of classes (MNIST images fall into 10 classes)样本类别数lambda = 1e-4; % Weight decay parameter衰减项权重%%======================================================================
%% STEP 1: Load data
%
%  In this section, we load the input and output data.
%  For softmax regression on MNIST pixels, 
%  the input data is the images, and 
%  the output data is the labels.
%% Change the filenames if you've saved the files under different names
% On some platforms, the files might be saved as 
% train-images.idx3-ubyte / train-labels.idx1-ubyte
addpath mnist/
images = loadMNISTImages('mnist/train-images.idx3-ubyte');%每一列为MNIST数据集中的一个样本的特征向量
labels = loadMNISTLabels('mnist/train-labels.idx1-ubyte');% 每个样本的类标
labels(labels==0) = 10; % Remap 0 to 10inputData = images;
clear images
% For debugging purposes, you may wish to reduce the size of the input data
% in order to speed up gradient checking. 
% Here, we create synthetic dataset using random data for testingDEBUG = true; % Set DEBUG to true when debugging.
if DEBUGinputSize = 8;inputData = randn(8, 100);labels = randi(10, 100, 1);
end%它的维数是k*n，k是类别数numClasses，n是输入样本的特征维数inputSize
% Randomly initialise theta
theta = 0.005 * randn(numClasses * inputSize, 1); %输入的是一个列向量%%======================================================================
%% STEP 2: Implement softmaxCost
%
%  Implement softmaxCost in softmaxCost.m. [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, inputData, labels);%%======================================================================
%% STEP 3: Gradient checking
%
%  As with any learning algorithm, you should always check that your
%  gradients are correct before learning the parameters.
% if DEBUGnumGrad = computeNumericalGradient( @(x) softmaxCost(x, numClasses, ...inputSize, lambda, inputData, labels), theta);% Use this to visually compare the gradients side by sidedisp([numGrad grad]); % Compare numerically computed gradients with those computed analyticallydiff = norm(numGrad-grad)/norm(numGrad+grad);disp(diff); % The difference should be small. % In our implementation, these values are usually less than 1e-7.% When your gradients are correct, congratulations!
end%%======================================================================
%% STEP 4: Learning parameters
%
%  Once you have verified that your gradients are correct, 
%  you can start training your softmax regression code using softmaxTrain
%  (which uses minFunc).options.maxIter = 100;
%softmaxModel其实只是一个结构体，里面包含了学习到的最优参数以及输入尺寸大小和类别个数信息
softmaxModel = softmaxTrain(inputSize, numClasses, lambda, ...inputData, labels, options);% Although we only use 100 iterations here to train a classifier for the 
% MNIST data set, in practice, training for more iterations is usually
% beneficial.%%======================================================================
%% STEP 5: Testing
%
%  You should now test your model against the test images.
%  To do this, you will first need to write softmaxPredict
%  (in softmaxPredict.m), which should return predictions
%  given a softmax model and the input data.images = loadMNISTImages('mnist/t10k-images.idx3-ubyte');
labels = loadMNISTLabels('mnist/t10k-labels.idx1-ubyte');
labels(labels==0) = 10; % Remap 0 to 10inputData = images;
clear images
size(softmaxModel.optTheta)
size(inputData)
% You will have to implement softmaxPredict in softmaxPredict.m
[pred] = softmaxPredict(softmaxModel, inputData);acc = mean(labels(:) == pred(:));
fprintf('Accuracy: %0.3f%%\n', acc * 100);% Accuracy is the proportion of correctly classified images
% After 100 iterations, the results for our implementation were:
%
% Accuracy: 92.200%
%
% If your values are too low (accuracy less than 0.91), you should check 
% your code for errors, and make sure you are training on the 
% entire data set of 60000 28x28 training images 
% (unless you modified the loading code, this should be the case)

softmaxCost. m

function [cost, grad] = softmaxCost(theta, numClasses, inputSize, lambda, data, labels)% numClasses - the number of classes 
% inputSize - the size N of the input vector
% lambda - weight decay parameter
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
% labels - an M x 1 matrix containing the labels corresponding for the input data
%% Unroll the parameters from theta
theta = reshape(theta, numClasses, inputSize);%将输入的参数列向量变成一个矩阵numCases = size(data, 2);%输入样本的个数groundTruth = full(sparse(labels, 1:numCases, 1));
%产生一个100*100的矩阵，它的第labels(i)行第i列的元素值为1，其余全为0，其中i为1到numCases，即：1到100
cost = 0;thetagrad = zeros(numClasses, inputSize);%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute the cost and gradient for softmax regression.
%                You need to compute thetagrad and cost.
%                The groundTruth matrix might come in handy.M = bsxfun(@minus,theta*data,max(theta*data, [], 1));
% max(theta*data, [], 1)返回theta*data每一列的最大值，返回值为行向量
% theta*data的每个元素值都减去其对应列的最大值，即：把每一列的最大值都置为0了
% 这一步的目的是防止下一步计算指数函数时溢出
M = exp(M);
p = bsxfun(@rdivide, M, sum(M));
cost = -1/numCases * groundTruth(:)' * log(p(:)) + lambda/2 * sum(theta(:) .^ 2);
thetagrad = -1/numCases * (groundTruth - p) * data' + lambda * theta;% ------------------------------------------------------------------
% Unroll the gradient matrices into a vector for minFunc
grad = [thetagrad(:)];
end

softmaxPredict.m

function [pred] = softmaxPredict(softmaxModel, data)% softmaxModel - model trained using softmaxTrain
% data - the N x M input matrix, where each column data(:, i) corresponds to
%        a single test set
%
% Your code should produce the prediction matrix 
% pred, where pred(i) is argmax_c P(y(c) | x(i)).% Unroll the parameters from theta
theta = softmaxModel.optTheta;  % this provides a numClasses x inputSize matrix
pred = zeros(1, size(data, 2));%% ---------- YOUR CODE HERE --------------------------------------
%  Instructions: Compute pred using theta assuming that the labels start 
%                from 1.% t=theta*data;
% [~,pred]=max(t,[],1);
% pred=pred';
[nop, pred] = max(theta * data);%  pred= max(peed_temp);% ---------------------------------------------------------------------end