【Machine Learning】Other Stuff

本笔记基于清华大学《机器学习》的课程讲义中有关机器学习的此前未提到的部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。

Robust Machine Learning

Attack: PGD

  • max ⁡ δ ∈ Δ L o s s ( f θ ( x + δ ) , y ) \max_{\delta\in \Delta}Loss(f_\theta(x+\delta),y) maxδΔLoss(fθ(x+δ),y): find δ \delta δ that maximizes the loss
  • How to compute δ \delta δ?
    • Projected Gradient Descent: GD then project back to Δ \Delta Δ
    • Fast Gradient Sign Method: For example, Δ = { δ : ∣ δ ∣ ∞ ≤ ϵ } \Delta =\{\delta:|\delta|_\infty\le \epsilon\} Δ={δ:δϵ}. Then as learning rate goes to ∞ \infty , we always go to corner. Then δ = η ⋅ s i g n ( ∇ x L θ ( x + δ t , y ) ) \delta=\eta\cdot sign(\nabla_xL_\theta(x+\delta_t,y)) δ=ηsign(xLθ(x+δt,y)), we only take the sign.
    • PGD: runs FGSM multiple times.

Defend: Adversarial Training

  • Daskin’s Theorem:
    ∂ ∂ θ max ⁡ δ L ( f θ ( x + δ , y ) ) = ∂ ∂ θ L ( f θ ( x + δ ∗ , y ) ) \frac{\partial}{\partial \theta }\max_\delta L(f_\theta(x+\delta,y))=\frac{\partial }{\partial \theta}L(f_\theta(x+\delta^*,y)) θδmaxL(fθ(x+δ,y))=θL(fθ(x+δ,y))
    Only care about the worst attack samples. For a batch of samples, find δ ∗ \delta^* δ, then do GD on θ \theta θ based on δ ∗ \delta^* δ.

  • Models become more semantically meaningful by adversarial training

Robust Feature Learning

  • For a dog photo x x x, attack it to x ′ x' x, so that f ( x ′ ) = c a t f(x')=cat f(x)=cat. This ( x ′ , c a t ) (x',cat) (x,cat) training set gives model the “non-robust” features.
  • For train image x x x, generate from random initialization x τ x_\tau xτ such that g ( x ) ≈ g ( x τ ) g(x)\approx g(x_\tau) g(x)g(xτ). x τ x_\tau xτ is also a good training sample that gives similar robust feature to the model.

False Sense of Security

  • Backward pass differentiable approximation(ignore the gradient that is hard to compute).
  • Take multiple samples to solve the randomness.

Provable Robust Certificates

  • Classification problem: use histogram, pick the largest color nearby.
  • Compare the histogram centered at x x x and x + δ x+\delta x+δ:
    • Worst case: Greedy filling(*)

Hyperparameter

Bayesian Optimization

  • estimate the parameters as Gaussian.
  • explore both the great-variance point and the low point.

Gradient descent

  • Directly use GD to find hyperparameter: memory storage problem
  • SGD with momentum: v t + 1 = γ t v t − ( 1 − γ t ) ∇ w L ( w t , θ , t ) v_{t+1}=\gamma_tv_t-(1-\gamma_t)\nabla _wL(w_t,\theta ,t) vt+1=γtvt(1γt)wL(wt,θ,t)
    • Store w t + 1 , v t + 1 w_{t+1},v_{t+1} wt+1,vt+1 and ∇ w L ( w t , θ , t ) \nabla _wL(w_t,\theta,t) wL(wt,θ,t), we can recover the whole chain
  • Need: Continuous

Random Search

  • Work better than grid search

Best Arm identification

  • Successive Halving(SH) Algorithm(*)

    • Per round use B / log ⁡ 2 ( n ) B/\log_2(n) B/log2(n), log ⁡ 2 ( n ) \log_2(n) log2(n) rounds totally.
    • Each round, every survivor use the same budget. Only half of them survive in each round.
  • Assume v 1 > v 2 ≥ . . . ≥ v n , Δ i = v 1 − v i v_1>v_2\ge ...\ge v_n,\Delta_i=v_1-v_i v1>v2...vn,Δi=v1vi. With probability 1 − δ 1-\delta 1δ, the algorithm finds the best arm with B = O ( max ⁡ i > 1 i Δ i 2 log ⁡ n log ⁡ ( log ⁡ n / δ ) ) B=O(\max_{i>1}\frac{i}{\Delta_i^2}\log n\log(\log n/\delta)) B=O(maxi>1Δi2ilognlog(logn/δ)) assuming v i ∈ [ 0 , 1 ] v_i\in [0,1] vi[0,1]

    • Proof. Concentration inequaility, probability that 1 3 \frac{1}{3} 31 of 3 4 N r \frac{3}{4}N_r 43Nr smaller ones are greater than best one is bounded, union bound for all round.
  • Hyperparameter tuning

    • B ≥ 2 log ⁡ 2 ( n ) ( n + ∑ i = 2... n γ ˉ − 1 ( v i − v 1 2 ) ) B\ge 2\log_2(n)\left(n+\sum_{i=2...n}\bar{\gamma}^{-1}(\frac{v_i-v_1}{2})\right) B2log2(n)(n+i=2...nγˉ1(2viv1))

Neural Architecture Search

  • Reinforcement learning
  • ProxylessNAS: each architecture has weight α i \alpha_i αi to be chosen. Choose a binary variable, that is only one architecture exists, and the probability is about α i \alpha_i αi. ∂ L ∂ α i ≈ ∂ L ∂ g i ∂ p i ∂ α i \frac{\partial L}{\partial \alpha_i}\approx\frac{\partial L}{\partial g_i}\frac{\partial p_i}{\partial \alpha_i} αiLgiLαipi, g i = 0 / 1 g_i=0/1 gi=0/1, ∂ L / ∂ g j \partial L/\partial g_j L/gj means the influence of architecture j j j to L L L.

Machine Learning Augmented

Differencial Privacy

  • Randomness is essential.

    • If we have a non-trivial deterministic algorithm, we can change data base A A A to B B B one by one until their output is the same. There exists a pair of databases differ by one row, then we know about that row.
  • Database: histogram N ∣ X ∣ N^{|X|} NX for discrete set X X X(the categories).

  • M M M is ( ϵ , δ ) (\epsilon,\delta ) (ϵ,δ)-differentially privacy if ∀ x , y ∈ N ∣ X ∣ , ∣ x − y ∣ 1 ≤ 1 , S ⊆ M ( N ∣ X ∣ ) \forall x,y\in N^{|X|},|x-y|_1\le 1,S\subseteq M(N^{|X|}) x,yNX,xy11,SM(NX)
    P ( M ( x ) ∈ S ) ≤ e ϵ P ( M ( y ) ∈ S ) + δ P(M(x)\in S)\le e^{\epsilon}P(M(y)\in S)+\delta P(M(x)S)eϵP(M(y)S)+δ

  • Differencial privacy is immune to post-processing. Proof:

    • For deterministic function: Proved easily by inverse function
    • Random function is convex combination of deterministic function, that is, each deterministic function is chosen with some probability.
  • With ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-differential privacy(DP mechanism), the voting result will not change too much by changing one’s vote.
    E a ∼ f ( M ( x ) ) [ u i ( a ) ] = ∑ a ∈ A u i ( a ) Pr ⁡ f ( M ( x ) ) [ a ] ≤ ∑ a ∈ A u i ( a ) exp ⁡ ( ϵ ) Pr ⁡ f ( M ( y ) ) [ a ] = exp ⁡ ( ϵ ) E a ∼ f ( M ( y ) ) [ u i ( a ) ] \begin{align*} E_{a\sim f(M(x))}[u_i(a)]&=\sum_{a\in A} u_i(a)\Pr_{f(M(x))}[a]\\&\le \sum_{a\in A}u_i(a)\exp(\epsilon)\Pr_{f(M(y))}[a]\\&=\exp(\epsilon)E_{a\sim f(M(y))}[u_i(a)] \end{align*} Eaf(M(x))[ui(a)]=aAui(a)f(M(x))Pr[a]aAui(a)exp(ϵ)f(M(y))Pr[a]=exp(ϵ)Eaf(M(y))[ui(a)]
    f f f is a map to the feature, and we only consider the expected voting utility u i ( a ) u_i(a) ui(a) about the feature a a a.

  • Laplace Mechanism: reach ( ϵ , 0 ) (\epsilon,0) (ϵ,0)-DP by just adding a random gaussian noise to f ( x ) , x ∈ N ∣ X ∣ f(x),x\in N^{|X|} f(x),xNX

    • Assume f : N ∣ X ∣ → R k f:N^{|X|}\to \R^{k} f:NXRk

    • Definition: M L ( x , f , ϵ ) = f ( x ) + ( Y 1 , . . , Y k ) M_L(x,f,\epsilon)=f(x)+(Y_1,..,Y_k) ML(x,f,ϵ)=f(x)+(Y1,..,Yk)

      • Y i Y_i Yi is iid random variable from L a p ( Δ f ϵ ) Lap(\frac{\Delta f }{\epsilon}) Lap(ϵΔf)
      • l 1 l_1 l1 sensitive of f f f is Δ f = max ⁡ x , y , ∣ x − y ∣ 1 ≤ 1 ∣ f ( x ) − f ( y ) ∣ 1 \Delta f=\max_{x,y,|x-y|_1\le 1}|f(x)-f(y)|_1 Δf=maxx,y,xy11f(x)f(y)1. How sensitive f f f is by changing one entry a little.
      • Probability density: L a p ( b ) = 1 2 b exp ⁡ ( − ∣ x ∣ b ) Lap(b)=\frac{1}{2b}\exp(-\frac{|x|}{b}) Lap(b)=2b1exp(bx). Variance σ 2 = 2 b 2 \sigma^2=2b^2 σ2=2b2.
    • Proof. The probability density of M L ( x , f , ϵ ) M_L(x,f,\epsilon) ML(x,f,ϵ) is p x ( z ) = ∏ i = 1 k exp ⁡ ( − ϵ ∣ f ( x ) i − z i ∣ Δ f ) p_x(z)=\prod_{i=1}^k\exp\left(-\frac{\epsilon|f(x)_i-z_i|}{\Delta f}\right) px(z)=i=1kexp(Δfϵf(x)izi). Easily prove p x ( z ) / p y ( z ) ≤ exp ⁡ ( ϵ ) p_x(z)/p_y(z)\le \exp(\epsilon) px(z)/py(z)exp(ϵ)

    • Accuracy: for δ ∈ ( 0 , 1 ] \delta\in(0,1] δ(0,1],
      Pr ⁡ [ ∣ f ( x ) − M L ( x , f , ϵ ) ∣ ∞ ≥ ln ⁡ ( k δ ) ⋅ ( Δ f ϵ ) ] ≤ δ \Pr\left[|f(x)-M_L(x,f,\epsilon)|_\infty\ge \ln\left(\frac{k}{\delta}\right)\cdot \left(\frac{\Delta f}{\epsilon}\right)\right]\le \delta Pr[f(x)ML(x,f,ϵ)ln(δk)(ϵΔf)]δ

      • Directly prove by Pr ⁡ [ ∣ Y ∣ ≥ t ⋅ b ] = exp ⁡ ( − t ) \Pr[|Y|\ge t\cdot b]=\exp(-t) Pr[Ytb]=exp(t) for Y ∼ L a p ( b ) Y\sim Lap(b) YLap(b)

Big data

  • Idea: data distribution rarely changes.
  • Replace B-tree with a model. Know err_min err_max because you only care about the data in your database
    • advantage: less storage, faster lookups, more parallelism, hardware accelators
    • use linear model(fast)
    • Do exponential search since the prediction is good enough
    • Bloom Filter: is the key in my set=>maybe yes/no

Low Rank Apporximation

  • A ∈ R n × m A\in \mathbb{R}^{n\times m} ARn×m, [ A ] k = arg ⁡ min ⁡ r a n k ( A ′ ) ≤ k ∣ A − A ′ ∣ F [A]_k=\arg \min_{rank(A')\le k} |A-A'|_F [A]k=argminrank(A)kAAF
  • Learn S ∈ R p × m S\in \mathbb{R}^{p\times m} SRp×m and do low rank decomposition for S A SA SA, then do some recover.

Recified flow

  • Given empirical observations about distribution π 0 \pi_0 π0 and π 1 \pi_1 π1. Find a transport map T T T that T ( Z ) ∼ π 1 T(Z)\sim \pi_1 T(Z)π1 for Z ∼ π 0 Z\sim \pi_0 Zπ0.
    • E.g. π 0 \pi_0 π0 is gaussian noise, π 1 \pi_1 π1 are data containing pictures. We want to find a map from noise to picture(diffusion) that has some features.
    • If T ( X 0 ) = X 1 T(X_0)=X_1 T(X0)=X1, then we want v ( t X 0 + ( 1 − t ) X 1 , t ) = d X d t = X 1 − X 0 v(tX_0+(1-t)X_1,t)=\frac{dX}{dt}=X_1-X_0 v(tX0+(1t)X1,t)=dtdX=X1X0. The transformation is linear. So that one cluster will map to another clusters.
    • Minimize the loss E [ ∥ ( X 1 − X 0 ) − v ( X t , t ) ∥ 2 ] \mathbb{E}[\|(X_1-X_0)-v(X_t,t)\|^2] E[(X1X0)v(Xt,t)2]
  • Algorithm:
    • randomly connect π 0 , π 1 \pi_0,\pi_1 π0,π1
    • minimize loss of $\theta $ for v θ v_\theta vθ
    • connect π 0 , π 1 \pi_0,\pi_1 π0,π1 by v θ v_\theta vθ again, and repeat minimization.

Rope Attention

  • Embed position to q m , k n q_m,k_n qm,kn for attention by rotation in complex field
    • β \beta β-base system: focus on different accuracy level
  • use inner product of q m , k n q_m,k_n qm,kn to represent their similarity
  • Softmax attention Attention ( Q , K , V ) = softmax ( Q K ⊤ ) V \text{Attention}(Q,K,V)=\text{softmax}(QK^\top)V Attention(Q,K,V)=softmax(QK)V
  • Faster computation
    • linear
    • softmax separately
    • design Q K ⊤ QK^\top QK

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/621360.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

linux环境下安装postgresql

PostgreSQL: Linux downloads (Red Hat family)postgresql官网 PostgreSQL: Linux downloads (Red Hat family) 环境: centos7 postgresql14 选择版本 执行启动命令 配置远程连接文件 vi /var/lib/pqsql/14/data/postgresql.conf 这里将listen_addresses值由lo…

力扣399. 除法求值

广度优先搜索 思路: 题目梳理 输入信息里包含 A / B val;输入一堆 A / B 需要求出它们的比值;以操作数为节点,构建权重图,节点对应了关联节点及其比值;可以进行简单的处理,将代数变量映射为整…

【位运算】【二分查找】【C++算法】100160价值和小于等于 K 的最大数字

作者推荐 【动态规划】【字符串】扰乱字符串 本文涉及的基础知识点 二分查找算法合集 位运算 LeetCode100160. 价值和小于等于 K 的最大数字 给你一个整数 k 和一个整数 x 。 令 s 为整数 num 的下标从1 开始的二进制表示。我们说一个整数 num 的 价值 是满足 i % x 0 且…

数据分析---SQL(3)

目录 IF和CASE WHEN的区别IF函数示例CASE WHEN示例如何求字段整体的标准差和均值什么是笛卡尔积笛卡尔积通常的使用场景是什么rank、dense rank、row number的区别rank函数示例dense rank函数示例row number函数示例什么是窗口函数除了rank之外还有哪些窗口函数IF和CASE WHEN的…

阿里云ingress配置时间超时的参数

一、背景 在使用阿里云k8s集群的时候,内网API网关,刚开始是用的是Nginx,后面又搭建了ingress。 区别于nginx配置,ingress又该怎么设置参数呢?比如http超时时间等等。 本文会先梳理nginx是如何配置,再对比…

优雅的删除链表元

王有志,一个分享硬核Java技术的互金摸鱼侠加入Java人的提桶跑路群:共同富裕的Java人 在数据结构:链表中,我们实现了链表的删除方法,但代码看起来并不“优雅”,那么今天我们就来尝试使用多种方法&#xff0c…

Halcon 3D相关算子(二)

(1) moments_object_model_3d( : : ObjectModel3D, MomentsToCalculate : Moments) 功能:计算3D对象模型的平均值或中心二阶矩。要计算3D物体模型点的平均值,在MomentsToCalculate中选择mean_points;如果要计算二阶中心矩,则选择…

windows安装conda环境,开发openai应用准备,运行第一个ai程序

文章目录 前言一、windows创建openai开发环境二、国内代理方式访问openai的方法(简单方法)三、测试运行第一个openai程序总结 前言 作者开发第一个openai应用的环境准备、第一个openai程序调用成功,做个记录,希望帮助新来的你。 …

如何处理Uniapp中的异步请求?

在Uniapp中处理异步请求有以下几种方法: 使用 uni.request 方法发送异步请求,该方法返回一个 Promise 对象,可以使用 then 方法处理请求成功的回调,使用 catch 方法处理请求失败的回调。 uni.request({url: http://api.example.…

鸿蒙系统ArkTs语法入门

鸿蒙系统ArkTs的ts语法入门 前言1. 变量声明2. 数据类型2.1 基本数据类型2.2 复杂数据类型2.3 联合类型2.4 空类型和未定义类型 3. 函数3.1 匿名函数和箭头函数 4. 类和接口类的访问权限接口类的继承内部类 7. 结构体参考材料 前言 每个语言都有控制流语句就不写测试代码了。a…

31 树的存储结构一

无法直接用数组表示树的逻辑结构,但是可以设计结构体数组对节点间的关系进行描述:【如表】 这样做的问题: 可以利用 组织链表 parent指针: 注意:树结点在 组织链表 中的位置不代表树的任何逻辑关系 树的架构图&#xf…

从0开始学Git指令(3)

从0开始学Git指令 因为网上的git文章优劣难评,大部分没有实操展示,所以打算自己从头整理一份完整的git实战教程,希望对大家能够起到帮助! 远程仓库 Git是分布式版本控制系统,同一个Git仓库,可以分布到不…

ubuntu主机开启ssh服务,ubuntu通过ssh访问主机

1.ubuntu通过ssh访问主机 要在Ubuntu上通过SSH(Secure Shell)访问另一台主机,您需要确保几件事情: 目标主机上的SSH服务器:确保您要访问的主机上安装并运行了SSH服务器(例如OpenSSH服务器)。 …

java客户端连接redis并设置序列化处理

1、导入依赖 <!--继承父依赖--> <parent><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-parent</artifactId><version>2.3.12.RELEASE</version><relativePath/> <!-- lookup paren…

服务器出现500、502、503错误的原因以及解决方法

服务器我们经常会遇到访问不了的情况有的时候是因为我们服务器被入侵了所以访问不了&#xff0c;有的时候是因为出现了服务器配置问题&#xff0c;或者软硬件出现问题导致的无法访问的问题&#xff0c;这时候会出现500、502、503等错误代码。基于以上问题我们第一步可以先重启服…

Chapter 7 类和对象的特性(上篇)

目的&#xff1a;认识类&#xff0c;对面向对象产生认识 &#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1f4ca;&#x1…

【Py/Java/C++三种语言详解】LeetCode每日一题240114【链表】LeetCode83、删除排序链表中的重复节点

文章目录 题目链接题目描述解题思路代码PythonJavaC时空复杂度 华为OD算法/大厂面试高频题算法练习冲刺训练 题目链接 LeetCode83、删除排序链表中的重复节点 题目描述 给定一个已排序的链表的头 head &#xff0c; 删除所有重复的元素&#xff0c;使每个元素只出现一次 。返…

Android json功能解析

1. 简介 JAVAScript Object Notation是一种轻量级的数据交换格式具有良好的可读和便于快速编写的特性。业内主流技术为其提供了完整的解决方案&#xff08;有点类似于正则表达式 &#xff0c;获得了当今大部分语言的支持&#xff09;。  JSON采用兼容性很高的文本格式&#xf…

第 380 场周赛 解题报告 | 珂学家 | 数位DP 二分 + 字符串Hash

前言 整体评价 感觉T3更难些&#xff0c;T4太直接了&#xff0c;一般的KMP/StringHash基本就够用了。 上周T4出数位DP&#xff0c;估计是为T3打了一个铺垫。 A. 最大频率元素计数 思路: 模拟即可 class Solution {public int maxFrequencyElements(int[] nums) {Map<Int…

qt5.14.2配置opencv4.5.5

使用环境&#xff1a;windows&#xff0c;opencv4.5.5&#xff0c;qt5.14.2&#xff0c;msvc编译器 这里的opencv文件是已经编译好了&#xff0c;在qt工程中配置就可使用&#xff0c;编译器得是msvc才行&#xff0c;MinGW不管用。 资源地址&#xff1a;https://download.csdn.…