理解深度学习pytorch框架中的线性层

文章目录

- 1. 数学角度： $\displaystyle y = W\,x + b$
- - 示例
- 2. 编程实现角度： $\displaystyle y = x\,W^T + b$
- 3. 常见错误与易混点解析
- 4. 小结
- 参考链接

在神经网络或机器学习的线性层（Linear Layer / Fully Connected Layer）中，经常会见到两种形式的公式：

数学文献或传统线性代数写法： $\displaystyle y = W\,x + b$
一些深度学习代码中写法： $\displaystyle y = x\,W^T + b$

初次接触时，很多人会觉得两者“方向”不太一样，不知该如何对照理解；再加上矩阵维度 $in_features , out_features ) (\text{in\_features},\, \text{out\_features})$ 和 $out_features , in_features ) (\text{out\_features},\, \text{in\_features})$ 的各种写法常常让人疑惑不已。本文将从数学角度和编程实现角度剖析它们的关系，并结合实际示例指出一些常见的坑与需要特别留意的下标对应问题。

1. 数学角度： $\displaystyle y = W\,x + b$

在线性代数中，如果我们假设输入 $x$ 是一个列向量，通常会写作 $in_features ) \displaystyle x\in\mathbb{R}^{(\text{in\_features})}$ （或者在更严格的矩阵形状记法下写作 $in_features , 1 ) (\text{in\_features},\,1)$ ）。那么一个最常见的全连接层可以表示为：

$W\,x + b,$

其中：

$W$ 是一个大小为 $out_features , in_features ) \bigl(\text{out\_features},\,\text{in\_features}\bigr)$ 的矩阵；
$b$ 是一个 $out_features \text{out\_features}$ -维的偏置向量（形状 $out_features , 1 ) (\text{out\_features},\,1)$ ）；
$y$ 则是输出向量，大小为 $out_features \text{out\_features}$ 。

示例

假设 $in_features = 3 \text{in\_features}=3$ ， $out_features = 2 \text{out\_features}=2$ 。那么：
$\in \mathbb{R}^{2\times 3},\quad x \in \mathbb{R}^{3\times 1},\quad b \in \mathbb{R}^{2\times 1}.$

矩阵写开来就是：

$\begin{bmatrix} w_{11} & w_{12} & w_{13} \\[5pt] w_{21} & w_{22} & w_{23} \end{bmatrix},\quad x = \begin{bmatrix} x_{1}\\ x_{2}\\ x_{3} \end{bmatrix},\quad b = \begin{bmatrix} b_{1}\\ b_{2} \end{bmatrix}.$

那么线性变换结果 $W x + b$ 可以展开为：

$\begin{aligned} Wx + b &= \begin{bmatrix} w_{11}x_1 + w_{12}x_2 + w_{13}x_3 \\ w_{21}x_1 + w_{22}x_2 + w_{23}x_3 \end{bmatrix} + \begin{bmatrix} b_1 \\ b_2 \end{bmatrix} \\ &= \begin{bmatrix} w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1 \\ w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2 \end{bmatrix}. \end{aligned}$

这就是最传统、在数学文献或线性代数课程中最常见的表示方法。

2. 编程实现角度： $\displaystyle y = x\,W^T + b$

在实际的深度学习代码（例如 PyTorch、TensorFlow）中，经常看到的却是下面这种写法：

y = x @ W.T + b

注意这里 W.shape 通常被定义为 $out_features , in_features ) (\text{out\_features},\, \text{in\_features})$ ，而 x.shape 在批量处理时则是 $batch_size , in_features ) (\text{batch\_size},\, \text{in\_features})$ 。于是 (x @ W.T) 的结果是 $batch_size , out_features ) (\text{batch\_size},\, \text{out\_features})$ 。

为什么会出现转置？
因为在数学里我们通常把 $x$ 当作“列向量”放在右边，于是公式变成 $y = W x + b$ 。
但在编程里，尤其是处理批量输入时，x 常写成“行向量”的形式 $batch_size , in_features ) (\text{batch\_size},\, \text{in\_features})$ ，这就造成了在进行矩阵乘法时，需要将 W（大小 $out_features , in_features ) (\text{out\_features},\, \text{in\_features})$ ）转置成 $in_features , out_features ) (\text{in\_features},\, \text{out\_features})$ ，才能满足「行×列」的匹配关系。

从结果上来看，

$batch_size , in_features ) × ( in_features , out_features ) = ( batch_size , out_features ) . (\text{batch\_size}, \text{in\_features}) \times (\text{in\_features}, \text{out\_features}) = (\text{batch\_size}, \text{out\_features}).$

所以，在代码里就写成 x @ W.T，再加上偏置 b（通常会广播到 $batch_size \text{batch\_size}$ 那个维度）。

本质上这和数学公式里 $W\,x + b$ 并无冲突，只是一个“列向量”和“行向量”的转置关系。只要搞清楚最终你想让输出 $y$ 的 shape 是多少，就能明白在代码里为什么要写 .T。

3. 常见错误与易混点解析

有些教程或文档，会不小心写成：“如果我们有一个形状为 $in_features , out_features ) (\text{in\_features},\text{out\_features})$ 的权重矩阵 $W$ ……”——然后又要做 $W x$ ，想得到一个 $out_features \text{out\_features}$ -维的结果。但按照线性代数的常规写法，行数必须和输出维度匹配、列数必须和输入维度匹配。所以正确的说法应该是

$out_features ) × ( in_features ) . W\in\mathbb{R}^{(\text{out\_features}) \times (\text{in\_features})}.$

否则从矩阵乘法次序来看就对不上。
但这又可能让人迷惑：为什么深度学习框架 torch.nn.Linear(in_features, out_features) 却给出 weight.shape == (out_features, in_features)？ 其实正是同一个道理，它和上面“数学文献里”用到的 $W$ 形状完全一致。

4. 小结

从数学角度：
最传统的记号是
$out_features ) × ( in_features ) , x ∈ R ( in_features ) , y ∈ R ( out_features ) . y = W\,x + b, \quad W \in \mathbb{R}^{(\text{out\_features})\times(\text{in\_features})},\, x \in \mathbb{R}^{(\text{in\_features})},\, y \in \mathbb{R}^{(\text{out\_features})}.$
从深度学习代码角度：
- 由于批量数据常被视为行向量，每一行代表一个样本特征，因此形状通常是 $batch_size , in_features ) (\text{batch\_size},\, \text{in\_features})$ 。
- 对应的权重 W 定义为 $out_features , in_features ) (\text{out\_features},\, \text{in\_features})$ 。为了完成行乘以列的矩阵运算，需要对 W 做转置：
```
y = x @ W.T + b
```
- 得到的 y.shape 即 $batch_size , out_features ) (\text{batch\_size},\, \text{out\_features})$ 。
避免踩坑：
- 写公式时，仔细确认 $in_features \text{in\_features}$ 、 $out_features \text{out\_features}$ 的位置以及矩阵行列顺序。
- 编程实践中理解“为什么要 .T”非常重要：那只是为了匹配「行×列」的矩阵乘法规则，本质上还是和 $y = W x + b$ 相同。