Gradients of Matrix-Matrix Multiplication in Deep Learning

Gradients of Matrix-Matrix Multiplication in Deep Learning

Understanding Artificial Neural Networks with Hands-on Experience - Part 1. Matrix Multiplication, Its Gradients and Custom Implementations
https://coolgpu.github.io/coolgpu_blog/github/pages/2020/09/22/matrixmultiplication.html

1. Matrix multiplication

The definition of matrix /ˈmeɪtrɪks/ multiplication /ˌmʌltɪplɪˈkeɪʃn/ can be found in every linear algebra /ˈældʒɪbrə/ book. Let’s use the definition from Wikipedia. Given a m × k m \times k m×k matrix A \boldsymbol {A} A and a k × n k \times n k×n matrix B \boldsymbol {B} B

A = [ a 11 a 12 ⋯ a 1 k a 21 a 22 ⋯ a 2 k ⋮ ⋮ ⋱ ⋮ a m 1 a m 2 ⋯ a m k ] (1) \begin{split}\boldsymbol{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1k} \\ a_{21} & a_{22} & \cdots & a_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mk} \\ \end{bmatrix}\end{split} \tag{1} A= a11a21am1a12a22am2a1ka2kamk (1)

and

B = [ b 11 b 12 ⋯ b 1 n b 21 b 22 ⋯ b 2 n ⋮ ⋮ ⋱ ⋮ b k 1 b k 2 ⋯ b k n ] (1) \begin{split}\boldsymbol{B}=\begin{bmatrix} b_{11} & b_{12} & \cdots & b_{1n} \\ b_{21} & b_{22} & \cdots & b_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ b_{k1} & b_{k2} & \cdots & b_{kn} \\ \end{bmatrix}\end{split} \tag{1} B= b11b21bk1b12b22bk2b1nb2nbkn (1)

their matrix product C = A B \boldsymbol {C} = \boldsymbol{A}\boldsymbol{B} C=AB is defined as

C = [ c 11 c 12 ⋯ c 1 n c 21 c 22 ⋯ c 2 n ⋮ ⋮ c i j ⋮ c m 1 c m 2 ⋯ c m n ] (2) \begin{split}\boldsymbol{C}=\begin{bmatrix} c_{11} & c_{12} & \cdots & c_{1n} \\ c_{21} & c_{22} & \cdots & c_{2n} \\ \vdots & \vdots &c_{ij} & \vdots \\ c_{m1} & c_{m2} & \cdots & c_{mn} \\ \end{bmatrix}\end{split} \tag{2} C= c11c21cm1c12c22cm2cijc1nc2ncmn (2)

where its element c i j {c_{ij}} cij is given by

c i j = ∑ t = 1 k a i t b t j (3) {c_{ij}} = \sum_{t = 1}^k {a_{it} }{b_{tj}} \tag{3} cij=t=1kaitbtj(3)

for i = 1 , … , m i = 1, \ldots ,m i=1,,m and j = 1 , … , n j = 1, \ldots ,n j=1,,n. In other words, c i j {c_{ij}} cij is the dot product of the i i ith row of A \boldsymbol {A} A and the j j jth column of B \boldsymbol {B} B.

2. Derivation of the gradients

2.1. Dimensions of the gradients

If we are considering an isolated matrix multiplication, the partial derivative matrix C \boldsymbol {C} C with respect to either matrix A \boldsymbol {A} A and matrix B \boldsymbol {B} B would be a 4-D hyper-space relationship, referred to as Jacobian Matrix. You will also find that there will be many zeros in the 4-D Jacobian Matrix because, as shown in Equation (3), c i j {c_{ij} } cij is a function of only the i i ith row of A \boldsymbol {A} A and the j j jth column of B \boldsymbol {B} B, and independent of other rows of A \boldsymbol {A} A and other columns of B \boldsymbol {B} B.

Jacobian matrix and determinant
https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant

isolate /ˈaɪsəleɪt/
vt. 使孤立;使绝缘;使隔离
n. 隔离种群
vi. 孤立;隔离
adj. 隔离的;孤立的

What we are considering here is not an isolated matrix multiplication. Instead, we are talking about matrix multiplication inside a neural network that will have a scalar loss function. For example, consider a simple case where the loss L {L} L is the mean of matrix C \boldsymbol {C} C:

L = 1 m × n ∑ i = 1 m ∑ j = 1 n c i j (4) L = \frac{1}{ {m \times n} } \sum \limits_{i = 1}^m \sum \limits_{j = 1}^n {c_{ij} } \tag{4} L=m×n1i=1mj=1ncij(4)

our focus is the partial derivatives of scalar L {L} L w.r.t. the input matrix A \boldsymbol {A} A and B \boldsymbol {B} B, ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL and ∂ L ∂ B \frac{ {\partial L} }{ {\partial \boldsymbol {B} } } BL, respectively. Therefore, ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL has the same dimension as A \boldsymbol {A} A, which is another m × k m \times k m×k matrix, and ∂ L ∂ B \frac{ {\partial L} }{ {\partial \boldsymbol {B} } } BL has the same dimension as B \boldsymbol {B} B, which is another k × n k \times n k×n matrix.

2.2. The chain rule

We will use the chain rule to do backpropagation of gradients. For such an important tool in neural networks, it doesn’t hurt to briefly summarize the chain rule just like in the previous post for one more time. Given a function L ( x 1 , x 2 , … x N ) L\left( { {x_1},{x_2}, \ldots {x_N} } \right) L(x1,x2,xN) as

L ( x 1 , … x N ) = L ( f 1 ( x 1 , … x N ) , f 2 ( x 1 , … x N ) , … , f M ( x 1 , … x N ) ) (5) L\left( { {x_1}, \ldots {x_N} } \right) = L\left( { {f_1}\left( { {x_1}, \ldots {x_N} } \right),{f_2}\left( { {x_1}, \ldots {x_N} } \right), \ldots ,{f_M}\left( { {x_1}, \ldots {x_N} } \right)} \right) \tag{5} L(x1,xN)=L(f1(x1,xN),f2(x1,xN),,fM(x1,xN))(5)

Then the gradient of L L L w.r.t x i {x_i} xi can be computed as

∂ L ∂ x i = ∂ L ∂ f 1 ∂ f 1 ∂ x i + ∂ L ∂ f 2 ∂ f 2 ∂ x i + ⋯ + ∂ L ∂ f M ∂ f M ∂ x i = ∑ m = 1 M ∂ L ∂ f m ∂ f m ∂ x i (6) \frac{ {\partial L} }{ {\partial {x_i} } } = \frac{ {\partial L} }{ {\partial {f_1} } }\frac{ {\partial {f_1} } }{ {\partial {x_i} } } + \frac{ {\partial L} }{ {\partial {f_2} } }\frac{ {\partial {f_2} } }{ {\partial {x_i} } } +\cdots + \frac{ {\partial L} }{ {\partial {f_M} } }\frac{ {\partial {f_M} } }{ {\partial {x_i} } } = \sum \limits_{m = 1}^M \frac{ {\partial L} }{ {\partial {f_m} } }\frac{ {\partial {f_m} } }{ {\partial {x_i} } } \tag{6} xiL=f1Lxif1+f2Lxif2++fMLxifM=m=1MfmLxifm(6)

Equation (6) can be understood from two perspectives:

  • Summation means that all possible paths through which x i {x_i} xi contributes to L L L should be included
  • Product means that, along each path m m m, the output gradient equals the upstream passed in, ∂ L ∂ f m \frac{ {\partial L} }{ {\partial {f_m} } } fmL, times the local gradient, ∂ f m ∂ x i \frac{ {\partial {f_m} } }{ {\partial {x_i} } } xifm.

2.3. Derivation of the gradient ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {\boldsymbol {A} } } } AL

In this section, we will use a 2 × 4 2 \times 4 2×4 matrix A \boldsymbol {A} A and a 4 × 3 4 \times 3 4×3 matrix B \boldsymbol {B} B as an example to step-by-step derive the partial derivative of ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL. Please note that the same derivation can be performed on a general m × k m \times k m×k matrix A \boldsymbol {A} A and k × n k \times n k×n matrix B \boldsymbol {B} B. A specific example is used here purely for the purpose of making it more straightforward.

Let’s start with writing the matrix A \boldsymbol {A} A, B \boldsymbol {B} B and their matrix product C = A B \boldsymbol {C} = AB C=AB in expanded format.

expand /ɪkˈspænd/
vt. 扩张;使膨胀;详述
vi. 张开,展开;发展

A = [ a 11 a 12 a 13 a 14 a 21 a 22 a 23 a 24 ] (7) \boldsymbol {A} = \left[ {\begin{array}{}{ {a_{11} } }&{ {a_{12} } }&{ {a_{13} } }&{ {a_{14} } }\\{ {a_{21} } }&{ {a_{22} } }&{ { \color{red} a_{23 } } }&{ {a_{24} } }\end{array} } \right] \tag{7} A=[a11a21a12a22a13a23a14a24](7)

and

B = [ b 11 b 12 b 13 b 21 b 22 b 23 b 31 b 32 b 33 b 41 b 42 b 43 ] (7) \boldsymbol {B} = \left[ {\begin{array}{}{ {b_{11} } }&{ {b_{12} } }&{ {b_{13} } }\\{ {b_{21} } }&{ {b_{22} } }&{ {b_{23} } }\\{ {b_{31} } }&{ {b_{32} } }&{ {b_{33} } }\\{ {b_{41} } }&{ {b_{42} } }&{ {b_{43} } }\end{array} } \right] \tag{7} B= b11b21b31b41b12b22b32b42b13b23b33b43 (7)

C = [ c 11 c 12 c 13 c 21 c 22 c 23 ] = [ a 11 a 12 a 13 a 14 a 21 a 22 a 23 a 24 ] [ b 11 b 12 b 13 b 21 b 22 b 23 b 31 b 32 b 33 b 41 b 42 b 43 ] = [ a 11 b 11 + a 12 b 21 + a 13 b 31 + a 14 b 41 a 11 b 12 + a 12 b 22 + a 13 b 32 + a 14 b 42 a 11 b 13 + a 12 b 23 + a 13 b 33 + a 14 b 43 a 21 b 11 + a 22 b 21 + a 23 b 31 + a 24 b 41 a 21 b 12 + a 22 b 22 + a 23 b 32 + a 24 b 42 a 21 b 13 + a 22 b 23 + a 23 b 33 + a 24 b 43 ] (8) \begin{aligned} \boldsymbol {C} &= \left[ {\begin{array}{}{ {c_{11} } }&{ {c_{12} } }&{ {c_{13} } }\\{ {c_{21} } }&{ {c_{22} } }&{ {c_{23} } }\end{array} } \right] = \left[ {\begin{array}{}{ {a_{11} } }&{ {a_{12} } }&{ {a_{13} } }&{ {a_{14} } }\\{ {a_{21} } }&{ {a_{22} } }&{ { \color{red} a_{23 } } }&{ {a_{24} } }\end{array} } \right]\left[ {\begin{array}{}{ {b_{11} } }&{ {b_{12} } }&{ {b_{13} } }\\{ {b_{21} } }&{ {b_{22} } }&{ {b_{23} } }\\{ {b_{31} } }&{ {b_{32} } }&{ {b_{33} } }\\{ {b_{41} } }&{ {b_{42} } }&{ {b_{43} } }\end{array} } \right] \\ &= \left[ {\begin{array}{}{ { {a_{11} }{b_{11} } + {a_{12} }{b_{21} } + {a_{13} }{b_{31} } + {a_{14} }{b_{41} } } }&{ { {a_{11} }{b_{12} } + {a_{12} }{b_{22} } + {a_{13} }{b_{32} } + {a_{14} }{b_{42} } } }&{ { {a_{11} }{b_{13} } + {a_{12} }{b_{23} } + {a_{13} }{b_{33} } + {a_{14} }{b_{43} } } }\\{ { {a_{21} }{b_{11} } + {a_{22} }{b_{21} } + { \color{red} a_{23 } }{b_{31} } + {a_{24} }{b_{41} } } }&{ { {a_{21} }{b_{12} } + {a_{22} }{b_{22} } + { \color{red} a_{23 } }{b_{32} } + {a_{24} }{b_{42} } } }&{ { {a_{21} }{b_{13} } + {a_{22} }{b_{23} } + { \color{red} a_{23 } }{b_{33} } + {a_{24} }{b_{43} } } }\end{array} } \right] \tag{8} \end{aligned} C=[c11c21c12c22c13c23]=[a11a21a12a22a13a23a14a24] b11b21b31b41b12b22b32b42b13b23b33b43 =[a11b11+a12b21+a13b31+a14b41a21b11+a22b21+a23b31+a24b41a11b12+a12b22+a13b32+a14b42a21b12+a22b22+a23b32+a24b42a11b13+a12b23+a13b33+a14b43a21b13+a22b23+a23b33+a24b43](8)

Consider an arbitrary element of A \boldsymbol {A} A, for example a 23 { \color{red} a_{23 } } a23, we have the local partial derivative of C \boldsymbol {C} C w.r.t. a 23 { \color{red} a_{23 } } a23 based on Equation (8).

∂ L ∂ A = ∂ L ∂ C ∂ C ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } = \frac{ {\partial L} }{ {\partial \boldsymbol {C} } }\frac{ {\partial \boldsymbol {C}} }{ {\partial \boldsymbol {A} } } AL=CLAC

∂ c 11 ∂ a 23 = 0 ∂ c 12 ∂ a 23 = 0 ∂ c 13 ∂ a 23 = 0 ∂ c 21 ∂ a 23 = ∂ ∂ a 23 ( a 21 b 11 + a 22 b 21 + a 23 b 31 + a 24 b 41 ) = 0 + 0 + ∂ ∂ a 23 ( a 23 b 31 ) + 0 = b 31 ∂ c 22 ∂ a 23 = ∂ ∂ a 23 ( a 21 b 12 + a 22 b 22 + a 23 b 32 + a 24 b 42 ) = 0 + 0 + ∂ ∂ a 23 ( a 23 b 32 ) + 0 = b 32 ∂ c 23 ∂ a 23 = ∂ ∂ a 23 ( a 21 b 13 + a 22 b 23 + a 23 b 33 + a 24 b 43 ) = 0 + 0 + ∂ ∂ a 23 ( a 23 b 33 ) + 0 = b 33 (9) \begin{aligned} \frac{ {\partial {c_{11} } } }{ {\partial { \color{red} a_{23 } } } } &= 0 \\ \frac{ {\partial {c_{12} } } }{ {\partial { \color{red} a_{23 } } } } &= 0 \\ \frac{ {\partial {c_{13} } } }{ {\partial { \color{red} a_{23 } } } } &= 0 \\ \frac{ {\partial {c_{21} } } }{ {\partial { \color{red} a_{23 } } } } &= \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { {a_{21} }{b_{11} } + {a_{22} }{b_{21} } + { \color{red} a_{23 } }{b_{31} } + {a_{24} }{b_{41} } } \right) = 0 + 0 + \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { { \color{red} a_{23 } }{b_{31} } } \right) + 0 = {b_{31} } \\ \frac{ {\partial {c_{22} } } }{ {\partial { \color{red} a_{23 } } } } &= \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { {a_{21} }{b_{12} } + {a_{22} }{b_{22} } + { \color{red} a_{23 } }{b_{32} } + {a_{24} }{b_{42} } } \right) = 0 + 0 + \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { { \color{red} a_{23 } }{b_{32} } } \right) + 0 = {b_{32} } \\ \frac{ {\partial {c_{23} } } }{ {\partial { \color{red} a_{23 } } } } &= \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { {a_{21} }{b_{13} } + {a_{22} }{b_{23} } + { \color{red} a_{23 } }{b_{33} } + {a_{24} }{b_{43} } } \right) = 0 + 0 + \frac{\partial }{ {\partial { \color{red} a_{23 } } } }\left( { { \color{red} a_{23 } }{b_{33} } } \right) + 0 = {b_{33} } \tag{9} \end{aligned} a23c11a23c12a23c13a23c21a23c22a23c23=0=0=0=a23(a21b11+a22b21+a23b31+a24b41)=0+0+a23(a23b31)+0=b31=a23(a21b12+a22b22+a23b32+a24b42)=0+0+a23(a23b32)+0=b32=a23(a21b13+a22b23+a23b33+a24b43)=0+0+a23(a23b33)+0=b33(9)

Using the chain rule, we have the partial derivative of the loss L L L w.r.t. a 23 { \color{red} a_{23 }} a23

∂ L ∂ a 23 = ∂ L ∂ c 11 ∂ c 11 ∂ a 23 + ∂ L ∂ c 12 ∂ c 12 ∂ a 23 + ∂ L ∂ c 13 ∂ c 13 ∂ a 23 + ∂ L ∂ c 21 ∂ c 21 ∂ a 23 + ∂ L ∂ c 22 ∂ c 22 ∂ a 23 + ∂ L ∂ c 23 ∂ c 23 ∂ a 23 = 0 + 0 + 0 + ∂ L ∂ c 21 b 31 + ∂ L ∂ c 22 b 32 + ∂ L ∂ c 23 b 33 = ∂ L ∂ c 21 b 31 + ∂ L ∂ c 22 b 32 + ∂ L ∂ c 23 b 33 (10) \begin{aligned} \frac{ {\partial L} }{ {\partial { \color{red} a_{23 } } } } &= \frac{ {\partial L} }{ {\partial {c_{11} } } }\frac{ {\partial {c_{11} } } }{ {\partial { \color{red} a_{23 } } } } + \frac{ {\partial L} }{ {\partial {c_{12} } } }\frac{ {\partial {c_{12} } } }{ {\partial { \color{red} a_{23 } } } } + \frac{ {\partial L} }{ {\partial {c_{13} } } }\frac{ {\partial {c_{13} } } }{ {\partial { \color{red} a_{23 } } } } + \frac{ {\partial L} }{ {\partial {c_{21} } } }\frac{ {\partial {c_{21} } } }{ {\partial { \color{red} a_{23 } } } } + \frac{ {\partial L} }{ {\partial {c_{22} } } }\frac{ {\partial {c_{22} } } }{ {\partial { \color{red} a_{23 } } } } + \frac{ {\partial L} }{ {\partial {c_{23} } } }\frac{ {\partial {c_{23} } } }{ {\partial { \color{red} a_{23 } } } } \\ &= 0 + 0 + 0 + \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{31} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{32} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{33} } \\ &= \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{31} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{32} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{33} } \tag{10} \end{aligned} a23L=c11La23c11+c12La23c12+c13La23c13+c21La23c21+c22La23c22+c23La23c23=0+0+0+c21Lb31+c22Lb32+c23Lb33=c21Lb31+c22Lb32+c23Lb33(10)

The second line in Equation (10) used the results from Equation (9).

Following a similar manner, we can derive the other elements of ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL as below

∂ L ∂ A = [ ∂ L ∂ a 11 ∂ L ∂ a 12 ∂ L ∂ a 13 ∂ L ∂ a 14 ∂ L ∂ a 21 ∂ L ∂ a 22 ∂ L ∂ a 23 ∂ L ∂ a 24 ] = [ ∂ L ∂ c 11 b 11 + ∂ L ∂ c 12 b 12 + ∂ L ∂ c 13 b 13 ∂ L ∂ c 11 b 21 + ∂ L ∂ c 12 b 22 + ∂ L ∂ c 13 b 23 ∂ L ∂ c 11 b 31 + ∂ L ∂ c 12 b 32 + ∂ L ∂ c 13 b 33 ∂ L ∂ c 11 b 41 + ∂ L ∂ c 12 b 42 + ∂ L ∂ c 13 b 43 ∂ L ∂ c 21 b 11 + ∂ L ∂ c 22 b 12 + ∂ L ∂ c 23 b 13 ∂ L ∂ c 21 b 21 + ∂ L ∂ c 22 b 22 + ∂ L ∂ c 23 b 23 ∂ L ∂ c 21 b 31 + ∂ L ∂ c 22 b 32 + ∂ L ∂ c 23 b 33 ∂ L ∂ c 21 b 41 + ∂ L ∂ c 22 b 42 + ∂ L ∂ c 23 b 43 ] (11) \begin{aligned} \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } &= \left[ {\begin{array}{}{\frac{ {\partial L} }{ {\partial {a_{11} } } } }&{\frac{ {\partial L} }{ {\partial {a_{12} } } } }&{\frac{ {\partial L} }{ {\partial {a_{13} } } } }&{\frac{ {\partial L} }{ {\partial {a_{14} } } } }\\{\frac{ {\partial L} }{ {\partial {a_{21} } } } }&{\frac{ {\partial L} }{ {\partial {a_{22} } } } }&{\frac{ {\partial L} }{ {\partial {a_{23} } } } }&{\frac{ {\partial L} }{ {\partial {a_{24} } } } }\end{array} } \right] \\ &= \left[ {\begin{array}{}{ {\frac{ {\partial L} }{ {\partial {c_{11} } } }{b_{11} } + \frac{ {\partial L} }{ {\partial {c_{12} } } }{b_{12} } + \frac{ {\partial L} }{ {\partial {c_{13} } } }{b_{13} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{11} } } }{b_{21} } + \frac{ {\partial L} }{ {\partial {c_{12} } } }{b_{22} } + \frac{ {\partial L} }{ {\partial {c_{13} } } }{b_{23} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{11} } } }{b_{31} } + \frac{ {\partial L} }{ {\partial {c_{12} } } }{b_{32} } + \frac{ {\partial L} }{ {\partial {c_{13} } } }{b_{33} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{11} } } }{b_{41} } + \frac{ {\partial L} }{ {\partial {c_{12} } } }{b_{42} } + \frac{ {\partial L} }{ {\partial {c_{13} } } }{b_{43} } } }\\{ { \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{11} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{12} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{13} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{21} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{22} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{23} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{31} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{32} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{33} } } }&{ { \frac{ {\partial L} }{ {\partial {c_{21} } } }{b_{41} } + \frac{ {\partial L} }{ {\partial {c_{22} } } }{b_{42} } + \frac{ {\partial L} }{ {\partial {c_{23} } } }{b_{43} } } }\end{array} } \right] \tag{11} \end{aligned} AL=[a11La21La12La22La13La23La14La24L]=[c11Lb11+c12Lb12+c13Lb13c21Lb11+c22Lb12+c23Lb13c11Lb21+c12Lb22+c13Lb23c21Lb21+c22Lb22+c23Lb23c11Lb31+c12Lb32+c13Lb33c21Lb31+c22Lb32+c23Lb33c11Lb41+c12Lb42+c13Lb43c21Lb41+c22Lb42+c23Lb43](11)

Equation (11) can be equivalently rewritten as a matrix product.

∂ L ∂ A = [ ∂ L ∂ c 11 ∂ L ∂ c 12 ∂ L ∂ c 13 ∂ L ∂ c 21 ∂ L ∂ c 22 ∂ L ∂ c 23 ] [ b 11 b 21 b 31 b 41 b 12 b 22 b 32 b 42 b 13 b 23 b 33 b 43 ] (12) \begin{aligned} \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } = \left[ {\begin{array}{}{\frac{ {\partial L} }{ {\partial {c_{11} } } } }&{\frac{ {\partial L} }{ {\partial {c_{12} } } } }&{\frac{ {\partial L} }{ {\partial {c_{13} } } } }\\{\frac{ {\partial L} }{ {\partial {c_{21} } } } }&{\frac{ {\partial L} }{ {\partial {c_{22} } } } }&{\frac{ {\partial L} }{ {\partial {c_{23} } } } }\end{array} } \right]\left[ {\begin{array}{}{ {b_{11} } }&{ {b_{21} } }&{ {b_{31} } }&{ {b_{41} } }\\{ {b_{12} } }&{ {b_{22} } }&{ {b_{32} } }&{ {b_{42} } }\\{ {b_{13} } }&{ {b_{23} } }&{ {b_{33} } }&{ {b_{43} } }\end{array} } \right]\tag{12} \end{aligned} AL=[c11Lc21Lc12Lc22Lc13Lc23L] b11b12b13b21b22b23b31b32b33b41b42b43 (12)

In fact, the first matrix is the upstream derivative ∂ L ∂ C \frac{ {\partial L} }{ {\partial \boldsymbol {C} } } CL and the second matrix is the transpose of B \boldsymbol {B} B. Then we have

∂ L ∂ A = ∂ L ∂ C B T (13) \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } = \frac{ {\partial L} }{ {\partial \boldsymbol {C} } }{\boldsymbol {B} ^T} \tag{13} AL=CLBT(13)

Equation (13) shows that, for a matrix multiplication C = A B \boldsymbol {C} = \boldsymbol{A}\boldsymbol{B} C=AB in a neural network, the derivative of the loss L L L w.r.t matrix A \boldsymbol {A} A equals the upstream derivative ∂ L ∂ C \frac{ {\partial L} }{ {\partial \boldsymbol {C} } } CL times the transpose of matrix B \boldsymbol {B} B.

Let’s check the dimensions. On the left hand side of Equation (13), ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL has a dimension of m × k m \times k m×k, the same as A \boldsymbol {A} A. On the right hand side, ∂ L ∂ C \frac{ {\partial L} }{ {\partial \boldsymbol {C} } } CL has a dimension of m × n m \times n m×n and B T {\boldsymbol {B} ^T} BT has a dimension of n × k n \times k n×k; therefore, their matrix product has a dimension of m × k m \times k m×k and matches that of ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL.

2.4. Derivation of the gradient ∂ L ∂ B \frac{ {\partial L} }{ {\partial \boldsymbol {\boldsymbol {B} } } } BL

3. Custom implementations and validation

4. Summary

In this post, we demonstrated how to derive the gradients of matrix multiplication in neural networks. While the derivation steps seem complex, the final equations of the gradients are pretty simple and easy to implement:

∂ L ∂ A = ∂ L ∂ C B T \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } = \frac{ {\partial L} }{ {\partial \boldsymbol {C} } }{\boldsymbol {B} ^T} AL=CLBT

∂ L ∂ B = A T ∂ L ∂ C \frac{ {\partial L} }{ {\partial \boldsymbol {B} } } = {\boldsymbol {A} ^T}\frac{ {\partial L} }{ {\partial \boldsymbol {C} } } BL=ATCL

In real neural networks applications, the matrix A \boldsymbol {A} A and B \boldsymbol {B} B typically come from the outputs of other layers. In those scenarios, the gradients ∂ L ∂ A \frac{ {\partial L} }{ {\partial \boldsymbol {A} } } AL and ∂ L ∂ B \frac{ {\partial L} }{ {\partial \boldsymbol {B} } } BL can serve as the upsteam gradients of those layers in backpropagation computing.

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] Understanding Artificial Neural Networks with Hands-on Experience - Part 1. Matrix Multiplication, Its Gradients and Custom Implementations, https://coolgpu.github.io/coolgpu_blog/github/pages/2020/09/22/matrixmultiplication.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/pingmian/78660.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Spring Boot 框架简介

✨ Spring Boot 框架简介 1️⃣ 🚀 快速构建 Spring Boot 能够快速构建可直接运行的、企业级 Spring 应用。 2️⃣ ⚙️ “约定优于配置” 该框架采用"约定优于配置"理念,默认集成 Spring 平台与主流第三方库,开发者仅需简单配置…

【Robocorp实战指南】Python驱动的开源RPA框架

目录 前言技术背景与价值当前技术痛点解决方案概述目标读者说明 一、技术原理剖析核心概念图解核心作用讲解关键技术模块说明技术选型对比 二、实战演示环境配置要求核心代码实现案例1:网页数据抓取案例2:Excel报表生成 运行结果验证 三、性能对比测试方…

如何使用 Spring Boot 实现分页和排序:配置与实践指南

在现代 Web 应用开发中,分页和排序是处理大量数据时提升用户体验和系统性能的关键功能。Spring Boot 结合 Spring Data JPA 提供了简单而强大的工具,用于实现数据的分页查询和动态排序,广泛应用于 RESTful API、后台管理系统等场景。2025 年&…

使用 LLM助手进行 Python 数据可视化

在数据科学中,数据可视化是一项至关重要的任务,旨在揭示数据背后的模式和洞察,并向观众传达这些信息。然而,在编程语言(如 Python)中创建有洞察力的图表有时可能会耗时且复杂。本文介绍了一种借助 AI 助手&…

RASP技术是应用程序安全的“保护伞”

对于企业组织而言,随着新技术的不断涌现,企业在应用程序和数据安全方面也面临着诸多挑战。之所以如此,是因为常见的保护措施,如入侵防御系统和Web应用程序防火墙,有助于检测和阻止网络层的攻击,但它们无法看…

安卓基础(接口interface)

​​1. 接口的定义与实现​​ ​​(1) 定义接口​​ // 定义一个 "动物行为" 的接口 public interface Animal {void eat(); // 抽象方法(无实现)void sleep(); // 抽象方法(无实现)// Java 8 默认方法&#…

Linux0.11内存管理:相关代码

ch13_2 源码分析 boot/head.s 页表初始化: 目标:初始化分页机制,将线性地址空间映射到物理内存(前 16MB),为保护模式下的内存管理做准备。核心流程 分配页目录表和页表的物理内存空间(通过 .…

【Redis】set类型

目录 1、介绍2、底层实现【1】整数集合【2】哈希表 3、常用指令 1、介绍 Redis的set集合类型是一种无序且元素唯一的数据结构,支持高效的成员判断、集合运算和随机访问。 2、底层实现 【1】整数集合 适用场景 当集合中所有的元素都是整数,且元素数量…

web技术与nginx网站环境部署

一:web基础 1.域名和DNS 1.1域名的概念 网络是基于TCP/IP协议进行通信和连接的,每一台主机都有一个唯一的标识(固定的IP地址),用以区别在网络上成千上万个用户和计算机。网络在区分所有与之相连的网络和主机时,均采用一种唯一、通用的地址…

LeetCode【剑指offer】系列(动态规划篇)

剑指offer10-I.斐波那契数列 题目链接 题目:斐波那契数(通常用F(n)表示)形成的序列称为斐波那契数列 。该数列由 0 和 1 开始,后面的每一项数字都是前面两项数字的和。也就是: F(0) 0,F(1) 1 F(n) F(…

JVM 内存分配策略

引言 在 Java 虚拟机(JVM)中,内存分配与垃圾回收是影响程序性能的核心机制。内存分配的高效性直接决定了对象创建的速率,而垃圾回收策略则决定了内存的利用率以及系统的稳定性。为了在复杂多变的应用场景中实现高效的内存管理&am…

【二分查找】寻找峰值(medium)

6. 寻找峰值(medium) 题⽬描述:解法⼆(⼆分查找算法):算法思路:C 算法代码:Java 算法代码: 题⽬链接:162. 寻找峰值 题⽬描述: 峰值元素是指其值…

MongoDB与PHP7的集成与优化

MongoDB与PHP7的集成与优化 引言 随着互联网技术的飞速发展,数据库技术在现代软件开发中扮演着越来越重要的角色。MongoDB作为一种流行的NoSQL数据库,以其灵活的数据模型和强大的扩展性受到众多开发者的青睐。PHP7作为当前最流行的服务器端脚本语言之一,其性能和稳定性也得…

【GIT】github中的仓库如何删除?

你可以按照以下步骤删除 GitHub 上的仓库(repository): 🚨 注意事项: ❗️删除仓库是不可恢复的操作,所有代码、issue、pull request、release 等内容都会被永久删除。 🧭 删除 GitHub 仓库步骤…

焊接机排错

焊接机 一、前定位后焊接 两个机台,①极柱定位,相机定位所有极柱点和mark点;②焊接机,相机定位mark点原理:极柱定位在成功定位到所有极柱点和mark点后,可以建立mark点和极柱点的关系。焊接机定位到mark点…

认识和使用Vuex-案例

集中管理共享的数据,易于开发和后期维护;能够高效的实现组件之间的数据共享,提高开发效率;存储在Vuex的数据是响应式的,能够实时保持页面和数据的同步; 安装Vuex依赖包 npm install vuex --save导入包 im…

LLM大模型中的基础数学工具—— 信号处理与傅里叶分析

Q51: 推导傅里叶变换 的 Parseval 定理 傅里叶变换的 Parseval 定理揭示了啥关系? Parseval 定理揭示了傅里叶变换中时域与频域的能量守恒关系,即信号在时域的总能量等于其在频域的总能量。这就好比一个物体无论从哪个角度称重,重量始终不…

对Mac文字双击或三击鼠标左键没有任何反应

目录 项目场景: 问题描述 原因分析: 解决方案: 项目场景: 在使用Mac系统的时候,使用Apple无线鼠标,双击左键能够选取某个单词或词语,三击左键能够选取某一行,(百度、…

Go语言企业级项目使用dlv调试

使用dlv调试Go语言代码 打包Go代码(禁止优化和内联(便于调试更复杂的逻辑)): go build -gcflags"all-N -l" -o xxx_api_debug.exe启动一个dlb监听可运行程序的端口: dlv --listen:2345 --headlesstrue --api-version…

Kafka命令行的使用/Spark-Streaming核心编程(二)

Kafka命令行的使用 创建topic kafka-topics.sh --create --zookeeper node01:2181,node02:2181,node03:2181 --topic test1 --partitions 3 --replication-factor 3 分区数量,副本数量,都是必须的。 数据的形式: 主题名称-分区编号。 在…