【Machine Learning】Other Stuff

本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分，基本为笔者在考试前一两天所作的Cheat Sheet。内容较多，并不详细，主要作为复习和记忆的资料。

Robust Machine Learning

Attack: PGD

$\max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y)$ : find $\delta$ that maximizes the loss
How to compute $\delta$ ?
- Projected Gradient Descent: GD then project back to $\Delta$
- Fast Gradient Sign Method: For example, $\Delta =\{\delta:|\delta|_\infty\le \epsilon\}$ . Then as learning rate goes to $\infty$ , we always go to corner. Then $\delta=\eta\cdot sign(\nabla_xL_\theta(x+\delta_t,y))$ , we only take the sign.
- PGD: runs FGSM multiple times.

Defend: Adversarial Training

Daskin’s Theorem:
$\frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y))$
Only care about the worst attack samples. For a batch of samples, find $\delta^*$ , then do GD on $\theta$ based on $\delta^*$ .
Models become more semantically meaningful by adversarial training

Robust Feature Learning

For a dog photo $x$ , attack it to $x^{'}$ , so that $f (x^{'}) = c a t$ . This $(x^{'}, c a t)$ training set gives model the “non-robust” features.
For train image $x$ , generate from random initialization $x_\tau$ such that $g(x)\approx g(x_\tau)$ . $x_\tau$ is also a good training sample that gives similar robust feature to the model.

False Sense of Security

Backward pass differentiable approximation(ignore the gradient that is hard to compute).
Take multiple samples to solve the randomness.

Provable Robust Certificates

Classification problem: use histogram, pick the largest color nearby.
Compare the histogram centered at $x$ and $x+\delta$ :
- Worst case: Greedy filling(*)

Hyperparameter

Bayesian Optimization

estimate the parameters as Gaussian.
explore both the great-variance point and the low point.

Gradient descent

Directly use GD to find hyperparameter: memory storage problem
SGD with momentum: $v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t)$
- Store $w_{t+1},v_{t+1}$ and $\nabla _wL(w_t,\theta,t)$ , we can recover the whole chain
Need: Continuous

Random Search

Work better than grid search

Best Arm identification

Successive Halving(SH) Algorithm(*)
- Per round use $B/\log_2(n)$ , $log_2(n)$ rounds totally.
- Each round, every survivor use the same budget. Only half of them survive in each round.
Assume $v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i$ . With probability $1-\delta$ , the algorithm finds the best arm with $B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta))$ assuming $v_i\in [0,1]$
- Proof. Concentration inequaility, probability that $\frac{1}{3}$ of $\frac{3}{4}N_r$ smaller ones are greater than best one is bounded, union bound for all round.
Hyperparameter tuning
- $B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right)$

Neural Architecture Search

Reinforcement learning
ProxylessNAS: each architecture has weight $\alpha_i$ to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about $\alpha_i$ . $\frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i}$ , $g_i=0/1$ , $\partial L/\partial g_j$ means the influence of architecture $j$ to $L$ .

Machine Learning Augmented

Differencial Privacy

Randomness is essential.
- If we have a non-trivial deterministic algorithm, we can change data base $A$ to $B$ one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
Database: histogram $N^{|X|}$ for discrete set $X$ (the categories).
$M$ is $(\epsilon,\delta )$ -differentially privacy if $\forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|})$
$P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta$
Differencial privacy is immune to post-processing. Proof:
- For deterministic function: Proved easily by inverse function
- Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
With $(\epsilon,0)$ -differential privacy(DP mechanism), the voting result will not change too much by changing one’s vote.
$\begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*}$
$f$ is a map to the feature, and we only consider the expected voting utility $u_i(a)$ about the feature $a$ .
Laplace Mechanism: reach $(\epsilon,0)$ -DP by just adding a random gaussian noise to $f(x),x\in N^{|X|}$
- Assume $f:N^{|X|}\to \R^{k}$
- Definition: $M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k)$
  - $Y_i$ is iid random variable from $Lap(\frac{\Delta f }{\epsilon})$
  - $l_1$ sensitive of $f$ is $\Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1$ . How sensitive $f$ is by changing one entry a little.
  - Probability density: $Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b})$ . Variance $\sigma^2=2b^2$ .
- Proof. The probability density of $M_L(x,f,\epsilon)$ is $p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right)$ . Easily prove $p_x(z)/p_y(z)\le \exp(\epsilon)$
- Accuracy: for $\delta\in(0,1]$ ,
  $\Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta$
  - Directly prove by $\Pr[|Y|\ge t\cdot b]=\exp(-t)$ for $Y\sim Lap(b)$

Big data

Idea: data distribution rarely changes.
Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
- advantage: less storage, faster lookups, more parallelism, hardware accelators
- use linear model(fast)
- Do exponential search since the prediction is good enough
- Bloom Filter: is the key in my set=>maybe yes/no

Low Rank Apporximation

$A\in \mathbb{R}^{n\times m}$ , $[A]_k=\arg \min_{rank(A')\le k} |A-A'|_F$
Learn $S\in \mathbb{R}^{p\times m}$ and do low rank decomposition for $S A$ , then do some recover.

Recified flow

Given empirical observations about distribution $\pi_0$ and $\pi_1$ . Find a transport map $T$ that $T(Z)\sim \pi_1$ for $Z\sim \pi_0$ .
- E.g. $\pi_0$ is gaussian noise, $\pi_1$ are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
- If $T(X_0)=X_1$ , then we want $v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0$ . The transformation is linear. So that one cluster will map to another clusters.
- Minimize the loss $\mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2]$
Algorithm:
- randomly connect $\pi_0,\pi_1$
- minimize loss of $\theta $ for $v_\theta$
- connect $\pi_0,\pi_1$ by $v_\theta$ again, and repeat minimization.

Rope Attention

Embed position to $q_m,k_n$ for attention by rotation in complex field
- $\beta$ -base system: focus on different accuracy level
use inner product of $q_m,k_n$ to represent their similarity
Softmax attention $\text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V$
Faster computation
- linear
- softmax separately
- design $QK^\top$