高级优化理论与方法(七)
- Solving Linear Equations
- Case 2
- Theorem
- Kaczmarz's Algorithm
- Theorem
- Example
- Pseudoinverse
- Definition
- Special Case 1
- Special Case 2
- Properties of Pseudoinverse
- Lemma 1: Unique pseudoinverse
- Lemma 2: Full Rank Factorization
- Lemma 3
- Example
- Case 3
- Theorem
- Neural Networks
- Single neuron
- Case 1
- Case 2
- Multiple Layers
- 总结
Solving Linear Equations
Case 2
A x = b , A ∈ R m × n , m ≤ n , r a n k A = m , x ∈ R n , b ∈ R m Ax=b, A\in \mathbb{R}^{m\times n}, m\leq n, rank A=m, x\in \mathbb{R}^n, b\in \mathbb{R}^m Ax=b,A∈Rm×n,m≤n,rankA=m,x∈Rn,b∈Rm
⇒ \Rightarrow ⇒ infinite many solutions
⇒ m i n ∣ ∣ x ∣ ∣ \Rightarrow min ||x|| ⇒min∣∣x∣∣
s.t. A x = b Ax=b Ax=b
注:在此情形下,由于有无穷多解,可以将 A x = b Ax=b Ax=b看成是优化问题的约束条件,而这个问题可以看成有限子条件的优化问题。
Theorem
Thm: The unique solution x ∗ x^* x∗ of A x = b Ax=b Ax=b that minimizes ∣ ∣ x ∣ ∣ ||x|| ∣∣x∣∣ is given by x ∗ = A T ( A A T ) − 1 b x^*=A^T (AA^T)^{-1}b x∗=AT(AAT)−1b .
Kaczmarz’s Algorithm
为了避免计算逆矩阵,我们介绍Kaczmarz算法。
- Set i = 0 , x 0 i=0, x^0 i=0,x0
- For j = 1 , ⋯ , m j=1,\cdots, m j=1,⋯,m do
x i m + j = x i m + j − 1 + μ ( b j − a j T x i m + j − 1 ) a j a j T a j x^{im+j}=x^{im+j-1}+\mu(b_j-{a_j}^Tx^{im+j-1})\frac{a_j}{{a_j}^Ta_j} xim+j=xim+j−1+μ(bj−ajTxim+j−1)ajTajaj - i i i++, goto 2
注: 0 < μ < 2 0<\mu<2 0<μ<2。由于带限制的优化问题较为复杂,故在本门课程中对带限制的优化问题算法的收敛速度不进行研究。
Theorem
In Kaczmarz’s Algorithm, if x 0 = 0 x^0=0 x0=0, then x k → x ∗ = A T ( A A T ) − 1 b x^k\to x^*=A^T (AA^T)^{-1}b xk→x∗=AT(AAT)−1b as k → ∞ k\to \infty k→∞.
Example
A = [ 1 − 1 0 1 ] A=\begin{bmatrix} 1&-1 \\ 0&1 \end{bmatrix} A=[10−11]
b = [ 2 3 ] b=\begin{bmatrix} 2 \\ 3 \end{bmatrix} b=[23]
μ = 1 , x 0 = [ 0 0 ] \mu=1, x^0=\begin{bmatrix} 0 \\ 0 \end{bmatrix} μ=1,x0=[00]
a 1 = [ 1 − 1 ] a_1=\begin{bmatrix} 1 \\ -1 \end{bmatrix} a1=[1−1]
a 2 = [ 0 1 ] a_2=\begin{bmatrix} 0 \\ 1 \end{bmatrix} a2=[01]
b 1 = 2 , b 2 = 3 b_1=2, b_2=3 b1=2,b2=3
i = 0 , j = 1 : x 1 = [ 0 0 ] + ( 2 − [ 1 , − 1 ] ⋅ [ 0 0 ] ) [ 1 − 1 ] [ 1 , − 1 ] ⋅ [ 1 − 1 ] = [ 1 − 1 ] i=0, j=1: x^1=\begin{bmatrix} 0 \\ 0 \end{bmatrix}+\left(2-[1,-1]\cdot \begin{bmatrix} 0 \\ 0 \end{bmatrix}\right)\frac{\begin{bmatrix} 1 \\ -1 \end{bmatrix}}{[1,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}}=\begin{bmatrix} 1 \\ -1 \end{bmatrix} i=0,j=1:x1=[00]+(2−[1,−1]⋅[00])[1,−1]⋅[1−1][1−1]=[1−1]
i = 0 , j = 2 : x 2 = [ 1 − 1 ] + ( 3 − [ 0 , − 1 ] ⋅ [ 1 − 1 ] ) [ 0 1 ] [ 0 , 1 ] ⋅ [ 0 1 ] = [ 1 3 ] i=0, j=2: x^2=\begin{bmatrix} 1 \\ -1 \end{bmatrix}+\left(3-[0,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}\right)\frac{\begin{bmatrix} 0 \\ 1 \end{bmatrix}}{[0,1]\cdot \begin{bmatrix} 0 \\ 1 \end{bmatrix}}=\begin{bmatrix} 1 \\ 3 \end{bmatrix} i=0,j=2:x2=[1−1]+(3−[0,−1]⋅[1−1])[0,1]⋅[01][01]=[13]
i = 1 , j = 1 : x 3 = [ 1 3 ] + ( 2 − [ 1 , − 1 ] ⋅ [ 1 3 ] ) [ 1 − 1 ] [ 1 , − 1 ] ⋅ [ 1 − 1 ] = [ 3 1 ] i=1, j=1: x^3=\begin{bmatrix} 1 \\ 3 \end{bmatrix}+\left(2-[1,-1]\cdot \begin{bmatrix} 1 \\ 3 \end{bmatrix}\right)\frac{\begin{bmatrix} 1 \\ -1 \end{bmatrix}}{[1,-1]\cdot \begin{bmatrix} 1 \\ -1 \end{bmatrix}}=\begin{bmatrix} 3 \\ 1 \end{bmatrix} i=1,j=1:x3=[13]+(2−[1,−1]⋅[13])[1,−1]⋅[1−1][1−1]=[31]
⋯ \cdots ⋯
x ∗ = [ 5 3 ] x^*=\begin{bmatrix} 5 \\ 3 \end{bmatrix} x∗=[53]
Pseudoinverse
Definition
A + ∈ R m × n A^+\in\mathbb{R}^{m\times n} A+∈Rm×n is a pseudoinverse of A A A, if A A T A = A AA^TA=A AATA=A and ∃ U ∈ R n × n , V ∈ R m × m \exist U\in \mathbb{R}^{n\times n}, V\in \mathbb{R}^{m\times m} ∃U∈Rn×n,V∈Rm×m s.t. A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV
注:pseudoinverse表示伪逆,是矩阵逆的一种广义形式。
Special Case 1
m ≥ n , r a n k A = n m\geq n, rank A=n m≥n,rankA=n
A + = ( A T A ) − 1 A T → A A + A = A A^+=(A^TA)^{-1}A^T\rightarrow AA^+A=A A+=(ATA)−1AT→AA+A=A
U = ( A T A ) − 1 , V = A ( A T A ) − 1 ( A T A ) − 1 A T U=(A^TA)^{-1}, V=A(A^TA)^{-1}(A^TA)^{-1}A^T U=(ATA)−1,V=A(ATA)−1(ATA)−1AT
A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV
Special Case 2
m ≤ n , r a n k A = m m\leq n, rank A=m m≤n,rankA=m
A + = A T ( A A T ) − 1 → A A + A = A A^+=A^T(AA^T)^{-1}\rightarrow AA^+A=A A+=AT(AAT)−1→AA+A=A
U = A T ( A A T ) − 1 ( A A T ) − 1 A , V = ( A A T ) − 1 U=A^T(AA^T)^{-1}(AA^T)^{-1}A, V=(AA^T)^{-1} U=AT(AAT)−1(AAT)−1A,V=(AAT)−1
A + = U A T , A + = A T V A^+=UA^T, A^+=A^TV A+=UAT,A+=ATV
Properties of Pseudoinverse
Lemma 1: Unique pseudoinverse
pf: Assume: A 1 + , A 2 + A_1^+, A_2^+ A1+,A2+ pseudoinverse of A A A.
A A 1 + A = A A 2 + A = A AA_1^+A=AA_2^+A=A AA1+A=AA2+A=A
And U 1 , U 2 ∈ R n × n , V 1 , V 2 ∈ R m × m U_1,U_2\in \mathbb{R}^{n\times n}, V_1,V_2\in \mathbb{R}^{m\times m} U1,U2∈Rn×n,V1,V2∈Rm×m
A 1 + = U 1 A T = A T V 1 , A 2 + = U 2 A T = A T V 2 A_1^+=U_1A^T=A^TV_1, A_2^+=U_2A^T=A^TV_2 A1+=U1AT=ATV1,A2+=U2AT=ATV2
Let D = A 2 + − A 1 + , U = U 2 − U 1 , V = V 2 − V 1 D=A_2^+-A_1^+, U=U_2-U_1, V=V_2-V_1 D=A2+−A1+,U=U2−U1,V=V2−V1
Then O = A D A , D = U A T = A T V O=ADA, D=UA^T=A^TV O=ADA,D=UAT=ATV
⇒ ( D A ) T D A = A T D T D A = A T V T A D A = O \Rightarrow (DA)^TDA=A^TD^TDA=A^TV^TADA=O ⇒(DA)TDA=ATDTDA=ATVTADA=O
⇒ D A = O \Rightarrow DA=O ⇒DA=O
⇒ D D T = D A U T = O ⇒ D = O \Rightarrow DD^T=DAU^T=O\Rightarrow D=O ⇒DDT=DAUT=O⇒D=O
Lemma 2: Full Rank Factorization
Let A ∈ R m × n , r a n k A = r ≤ m i n ( m , n ) A\in \mathbb{R}^{m\times n}, rank A=r\leq min(m,n) A∈Rm×n,rankA=r≤min(m,n). Then exist B ∈ R m × r B\in\mathbb{R}^{m\times r} B∈Rm×r and C ∈ R r × n C\in\mathbb{R}^{r\times n} C∈Rr×n, with r a n k B = r a n k C = r rankB=rankC=r rankB=rankC=r and A = B C A=BC A=BC.
Lemma 3
Let A ∈ R m × n A\in \mathbb{R}^{m\times n} A∈Rm×n have full rank factorization A = B C A=BC A=BC, Then A + = C + B + A^+=C^+B^+ A+=C+B+.
Example
A = [ 2 1 − 2 5 1 0 − 3 2 3 − 1 − 13 5 ] A=\begin{bmatrix} 2&1&-2&5 \\ 1&0&-3&2\\ 3&-1&-13&5 \end{bmatrix} A= 21310−1−2−3−13525
r a n k A = 2 rankA=2 rankA=2
B = [ 2 1 1 0 3 − 1 ] B=\begin{bmatrix} 2&1 \\ 1&0 \\ 3&-1 \end{bmatrix} B= 21310−1
C = [ 0 1 − 3 2 1 0 4 1 ] C=\begin{bmatrix} 0&1&-3&2 \\ 1&0&4&1 \end{bmatrix} C=[0110−3421]
A = B C , A + = C + B + A=BC, A^+=C^+B^+ A=BC,A+=C+B+
Case 3
A x = b , A ⊆ R m × n , r a n k A ≤ m i n ( m , n ) Ax=b, A\subseteq \mathbb{R}^{m\times n}, rankA\leq min(m,n) Ax=b,A⊆Rm×n,rankA≤min(m,n)
① x x x minimize ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Ax−b∣∣2
② x x x minimize ∣ ∣ x ∣ ∣ ||x|| ∣∣x∣∣
注:Case 1和Case 2的主要区别在于 m m m和 n n n的大小关系,此时 A A A满秩。Case 3则表示了 A A A不满秩的更为一般的情形。在 m = n m=n m=n时,Case 1和Case 2两种情况均适用。在 A A A满秩的时候,Case 3也适用。故在分类条件中,并没有严格分类讨论,而是把该结论适用的最大范围写了上去。
Theorem
Given A x = b Ax=b Ax=b with r a n k A = r rankA=r rankA=r, the unique vector x ∗ = A + b x^*=A^+b x∗=A+b minimizes ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Ax−b∣∣2. Furthermore, among all vectors that minimize ∣ ∣ A x − b ∣ ∣ 2 ||Ax-b||^2 ∣∣Ax−b∣∣2, x ∗ x^* x∗ is the unique one with minimal norm.
Neural Networks
Single neuron
Task: data fitting
Training data: < x d , y d > , x d ∈ R s × n , y d ∈ R s <x^d,y^d>, x^d\in \mathbb{R}^{s\times n}, y^d\in\mathbb{R}^s <xd,yd>,xd∈Rs×n,yd∈Rs
f ( x ) = ∑ i = 1 m ω i x i = ω T x f(x)=\sum_{i=1}^m \omega_ix_i=\omega^Tx f(x)=∑i=1mωixi=ωTx
Find ω ∈ R n \omega\in\mathbb{R}^n ω∈Rn minimizing 1 2 ∑ i = 1 n ( y i d − x i d T ω i ) 2 = 1 2 ∣ ∣ y d − x d T ω ∣ ∣ 2 \frac{1}{2}\sum_{i=1}^n (y_i^d-{x_i^d}^T\omega_i)^2=\frac{1}{2}||y^d-{x^d}^T\omega||^2 21∑i=1n(yid−xidTωi)2=21∣∣yd−xdTω∣∣2
Case 1
r a n k X = s ≤ n rank X=s\leq n rankX=s≤n
⇒ ∃ \Rightarrow \exist ⇒∃ infinite ω : y d = x d T ω \omega: y^d={x^d}^T\omega ω:yd=xdTω
⇒ m i n ∣ ∣ ω ∣ ∣ \Rightarrow min ||\omega|| ⇒min∣∣ω∣∣
s.t. y d = x d T ω y^d={x^d}^T\omega yd=xdTω
⇒ ω ∗ = x d ( x d T x d ) − 1 y d \Rightarrow \omega^*=x^d({x^d}^Tx^d)^{-1}y^d ⇒ω∗=xd(xdTxd)−1yd
⇒ \Rightarrow ⇒Kaczmarz’s Algorithm
Case 2
r a n k X = n ≤ s rank X=n\leq s rankX=n≤s
⇒ ω ∗ = ( x d T x d ) − 1 x d T y d \Rightarrow \omega^*=({x^d}^Tx^d)^{-1}{x^d}^Ty^d ⇒ω∗=(xdTxd)−1xdTyd
⇒ \Rightarrow ⇒ Gradient algorithm
ω k + 1 = ω k − α k ∇ f ( ω k ) \omega^{k+1}=\omega^k-\alpha^k \nabla f(\omega^k) ωk+1=ωk−αk∇f(ωk)
⇒ ω k + 1 = ω k + α k x d e k \Rightarrow \omega^{k+1}=\omega^k+\alpha^kx^de^k ⇒ωk+1=ωk+αkxdek, where e k = y d − x d T ω k e^k=y^d-{x^d}^T\omega^k ek=yd−xdTωk
y = f ( ∑ i = 1 m ω i x i ) ⇒ m i n 1 2 ∣ ∣ y d − f ( ∑ i = 1 m ω i x i ) ∣ ∣ 2 y=f(\sum_{i=1}^m \omega_ix_i) \Rightarrow min\frac{1}{2}||y^d-f(\sum_{i=1}^m \omega_ix_i)||^2 y=f(∑i=1mωixi)⇒min21∣∣yd−f(∑i=1mωixi)∣∣2
ω k + 1 = ω k + α k x d e k ∣ ∣ x d ∣ ∣ 2 \omega^{k+1}=\omega^k+\alpha^k\frac{x^de^k}{||x^d||^2} ωk+1=ωk+αk∣∣xd∣∣2xdek
[only one < x d , y d > <x^d,y^d> <xd,yd>]
e k = y d − f ( x d T ω k ) e^k=y^d-f({x^d}^T\omega^k) ek=yd−f(xdTωk)
Multiple Layers
backpropagation algorithm(反向传播算法)
由于情况比较复杂,此处不作展开。
总结
上节课介绍了解线性方程组的第一种情况,这节课介绍了第二种和第三种情况。为了使结论更具一般性,还引入了矩阵的伪逆概念。接下来开始介绍神经网络。对神经网络做了一些数学上的简化,为了便于理论研究。主要介绍了最简单的单层神经网络。还提及了多层神经网络的反向传播算法,但由于过于复杂,于是没有具体展开计算。