【Machine Learning】Supervised Learning

本笔记基于清华大学《机器学习》的课程讲义监督学习相关部分，基本为笔者在考试前一两天所作的Cheat Sheet。内容较多，并不详细，主要作为复习和记忆的资料。

Linear Regression

Perceptron

$f(x)=sign(w^\top x+b)$
convergence

Logistic Regression

output probability instead of labels.
Loss: Cross entropy $XE(y,p)=-\sum_iy_i\log p_i$ . $y_i$ is the actual probability

Ridge Regression

$l_2$ regularization $\frac{\lambda}{2}\|w\|^2_2$
Shrink every coordinate: $w'=w\cdot (1-\eta \lambda)$
- weight decay

LASSO Regression

Find sparse features: Want $\|w\|_0\le c$
$l_1$ regularization $\lambda\|w\|_1$
Gradient: add or minus $\eta \lambda$ to pull $w$ into $0$ .

Compressed Sensing*

Compare with LASSO: pick $X_{train}$ , observe $Y_{train}$ , and learn $w$ . Instead of given $X_{train},Y_{train}$ .
RIP condition: A matrix $A$ is $(\epsilon,s)$ -RIP if $\|Ax\|\approx \|x\|$ for $s$ -sparse $x$ .
Application: reconstruct $x$ on a RIP matrix is easy.
- For sparse $x$
- For spase $x$ + some noise
- Lemma: RIP implies almost orthogonality
- Proof*
How to find RIP matrix: Random
Non-linear sparse: $X = G (Z)$ , $Z$ is sparse, $G$ is non-linear.

SVM

Minimize $\|w\|_2+\lambda \sum_{i}\xi_i$ .
- Soft version: $y_iw^\top x_i\ge 1-\xi_i,\xi_i\ge 0$ .
- maximize margin $1/\|w\|_2$ .

Dual derivation

Primal problem
$\min_x x^2\\ x\ge b$
Lagrangian $L(x,\alpha)=x^2-\alpha(x-b)\text{ s.t. }\alpha\ge 0$
- Solution in primal problem correponds to a $L(x,\alpha)\le x^2$ . Thus $\min_x L(x,\alpha)$ is lower bound of the primal solution.
- Dual: We want to find the maximum lowerbound $\max_\alpha d(\alpha)=\max_\alpha \min_xL(x,a)$ .
- Strong duality $p^*=d^*$
For SVM(hard version)
- Primal: $\min\frac{\|w\|_2}{2}$ , $y_iw^\top x_i\ge 1$ .
- Dual: maximize(solved by taking derivative)
  $L(w,\alpha)=\frac{\|w\|_2}{2}-\sum_i\alpha (y_iw^\top x_i-1)\\ =\sum_{i}\alpha _i-\frac{1}{2}\sum_i\sum_jy_iy_j\alpha_i\alpha_j\langle x_i,x_j\rangle\\ \alpha\ge 0\\ w=\sum_{i}\alpha_iy_ix_i$

Kernel Method

Replace $\langle x_i,x_j\rangle$ with $\langle \phi(x_i),\phi(x_j)\rangle$ to embed $x$ into a high dimension space.
Use $k(x_i,x_j)$ to compute $\langle \phi(x_i),\phi(x_j)\rangle $. Usually gaussian kernel $e^{-\frac{|x_i-x_j|}{2\sigma^2}}$
Mercer’s theorem: positive semidefinite kernel matrix has a corresponding embedding $\phi$ .

Decision Tree

Boolean funcional analysis

$\mathcal{X}_S(x)=\prod_{i\in S}x_i$
$f(x)=\sum_{S}\hat{f_S}\mathcal{X}_S(x)$
$\hat{f_S}=\langle f,\mathcal{X}_S\rangle=\mathbb{E}_{x\sim D}[f(x)\mathcal{X}_S(x)]$
$\mathbb{E}_{x\sim D}[f(x)^2]=\sum_{S}\hat{f_S^2}$
Decision tree with $s$ leaf nodes can be converted into $\log(\frac{s}{\epsilon})$ -degree, sparsity- $\frac{s^2}{\epsilon}$ function that $2\epsilon$ -appoximates $T$ .
- log depth
- Combination of AND term
- Take the bases $S$ with big $\hat{f_S}$
Estimation
- LMN: $\hat{f_S}=\mathbb{E}_{x\sim D}[f(x)\mathcal{X}_S(x)]=\frac{1}{m}\sum_{i=1}^m f(x_i)\mathcal{X}_{S}(x_i)$
  - Not work well in practice. Have not guarantees in noisy setting
- Compressed Sensing
  - $y = A x + e$ : $x$ contains $\hat{f_S}$ , $A$ and $y$ are from samples.
  - Lasso find $x^*$ with bounded error.

Splitting variables

Greedy: Gini index*

Random forest

overfitting problem
Construct $B$ trees, every tree is trained by $n$ samples, which is from ${(x_i,y_i)\}_{i=1}^n$ with replacement(可重复), each element will miss with probability $(1-\frac{1}{n})^n=\frac{1}{e}$
Output the average of $B$ trees
Can also bag the features

Adaboost*

Combine weak learners. Make hard cases more likely.
Sample distribution $D_t(i)$ in $t$ round. $\epsilon_t=\Pr_{i\sim D_t}[h_t(x_i)\neq y_i]$ .
$D_1(i)=\frac{1}{m}$ Learn $h_t$ from $D_t(i)$
$D_{t+1}(i)=\begin{cases} \frac{1}{Z_t}D_t(i)e^{-\alpha_t} && y_i=h_t(x_i)\\ \frac{1}{Z_t}D_t(i)e^{\alpha_t} && y_i\neq h_t(x_i) \end{cases}\\ \alpha=\frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right)$
$H_{final}=sign(\sum \alpha_t h_t)$
$\epsilon_t=\frac{1}{2}-\gamma_t$ .
- $\gamma_t$ means how better this weak learner is than random classifier.
$error(H_{final})\le \prod _{t}2\sqrt{\epsilon_t(1-\epsilon_t)}\le \prod_t\sqrt{1-4\gamma_t^2}\le \exp\left(-2\sum_t\gamma_t^2\right)$
Proof:
- $D_T(i)=\frac{1}{m}\frac{\exp(-y_i\sum_{t}\alpha _th_t(x_i)}{\prod_tZ_t}$
- $\begin{align*} error(H_{final})&=\frac{1}{m}\sum_{i=1}^m 1_{y_if(x_i)\le 0}\\ &=\frac{1}{m}\sum_{i=1}^m \exp(-y_if(x_i))\\ &=\sum_{i=1}^m D_T(i)\prod_t Z_t\\ &=\prod_t Z_t \end{align*}$
- $Z_t=\sum_{i}D_t(i)\exp(-\alpha_ty_ih(x_i))=(1-\epsilon_t)e^{-\alpha_t}+\epsilon_te^{\alpha_t}$
  
  To minimize $Z_t$ , $\alpha_t=\frac{1}{2}\ln(\frac{1-\epsilon_t}{\epsilon_t})$ , then $Z_t=2\sqrt{\epsilon_t(1-\epsilon_t)}$
Generalization
Drawback: Only binary classification
Extension: Gradient boosting: Regression, minimize $\frac{1}{2}\sum_i(F(x_i)-y_i)^2$
- Adaboost use Coordinate descent: change $\alpha_t$ from $0$ to $\alpha_t$ in round $t$ . Only change one coordinate since the dimension is too large.
- learn a new regression tree $h (x)$ to fit $\partial L/\partial F(x_i)=F(x_i)-y$ for square loss. Then update $F'(x)=F(x)+\eta h(x)$