self-attention 的 CUDA 实现及优化 (上)

self-attention 的 CUDA 实现及优化 (上)

导 读

self-attention 是 Transformer 中最关键、最复杂的部分,也是 Transformer 优化的核心环节。理解 self-attention ,对于深入理解 Transformer 具有关键作用,本篇主要就围绕 self-attention 展开,由于该部分比较复杂,故分为上下两篇,本篇为上篇。

0****1

self-attention的CUDA简单实现

self-attention 的原理非常常见,在之前的文章中也分析很多,因此不在此介绍介绍其原理,仅解读代码。

1、CPU版本

以下是基础的 CPU 版本的实现,下面对其稍作分析:

• 输入inp 为 x 与 QKV_weight 相乘后得到的 QKV 值,对于b(batch size), t(sequence len), h(head) 的 q(query_t) 值的索引为 inp[b,t,h*hs:(h+1)hs] , k(key_t2) 值在此基础上偏移 C 维即可,即inp[b,t,h*hs+C:(h+1)hs+C]

•  得到 q,k 之后,便通过点乘计算 attention 值,算完一个 attn 值之后进行 scale 操作(同时记录最大值以便进行softmax),计算完一行后进行 mask 操作

•  进行 softmax 操作,得到 attn 值

•  索引 v(value_t2) 并与 attn 值进行矩阵乘法运算

// CPU code referencevoid attention_forward_cpu(float* out, float* preatt, float* att,const float* inp,int B, int T, int C, int NH) {// input is (B, T, 3C) Q,K,V// preatt, att are (B, NH, T, T)// output is (B, T, C)int C3 = C*3;int hs = C / NH; // head sizefloat scale = 1.0 / sqrtf(hs);for (int b = 0; b < B; b++) {for (int t = 0; t < T; t++) {for (int h = 0; h < NH; h++) {const float* query_t = inp + b * T * C3 + t * C3 + h * hs;float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;float* att_bth = att + b*NH*T*T + h*T*T + t*T;// pass 1: calculate query dot key and maxvalfloat maxval = -10000.0f; // TODO something betterfor (int t2 = 0; t2 <= t; t2++) {const float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key// (query_t) dot (key_t2)float val = 0.0f;for (int i = 0; i < hs; i++) {val += query_t[i] * key_t2[i];}val *= scale;if (val > maxval) {maxval = val;}preatt_bth[t2] = val;}// pad with -INFINITY outside of autoregressive region for debugging comparisonsfor (int t2 = t+1; t2 < T; t2++) {preatt_bth[t2] = -INFINITY;}// pass 2: calculate the exp and keep track of sumfloat expsum = 0.0f;for (int t2 = 0; t2 <= t; t2++) {float expv = expf(preatt_bth[t2] - maxval);expsum += expv;att_bth[t2] = expv;}float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;// pass 3: normalize to get the softmaxfor (int t2 = 0; t2 < T; t2++) {if (t2 <= t) {att_bth[t2] *= expsum_inv;} else {// causal attention mask. not strictly necessary to set to zero here// only doing this explicitly for debugging and checking to PyTorchatt_bth[t2] = 0.0f;}}// pass 4: accumulate weighted values into the output of attentionfloat* out_bth = out + b * T * C + t * C + h * hs;for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }for (int t2 = 0; t2 <= t; t2++) {const float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's valuefloat att_btht2 = att_bth[t2];for (int i = 0; i < hs; i++) {out_bth[i] += att_btht2 * value_t2[i];}}}}}
}

2、CUDA初步实现(V1)

仍然延续 CPU 版本的基本思路,只是计算的不同,拆分为 3 个 kernel 进行计算:

•  第一步:计算 attention 值,总共使用B*NH*T*T 个线程,即每个线程计算一个值

 // attention calculationint total_threads = B * NH * T * T;int num_blocks = ceil_div(total_threads, block_size);attention_query_key_kernel1<<<num_blocks, block_size>>>(preatt, inp, B, T, C, NH);

kernel 函数的实现如下:

__global__ void attention_query_key_kernel1(float* preatt, const float* inp,int B, int T, int C, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * NH * T * T;if (idx < total_threads) {int t2 = idx % T;int t = (idx / T) % T;if (t2 > t) {// autoregressive maskpreatt[idx] = -INFINITY;return;}int h = (idx / (T * T)) % NH;int b = idx / (NH * T * T);int C3 = C*3;int hs = C / NH; // head sizeconst float* query_t = inp + b * T * C3 + t * C3 + h * hs;const float* key_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C; // +C because it's key// (query_t) dot (key_t2)float val = 0.0f;for (int i = 0; i < hs; i++) {val += query_t[i] * key_t2[i];}val *= 1.0 / sqrtf(hs);preatt[idx] = val;}
}

•  第二步:softmax 操作,该操作在之前的 op 优化中已经详细讨论,不予赘述

_global__ void attention_softmax_kernel1(float* att, const float* preatt,int B, int T, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * T * NH;if (idx < total_threads) {int h = idx % NH;int t = (idx / NH) % T;int b = idx / (NH * T);const float* preatt_bth = preatt + b*NH*T*T + h*T*T + t*T;float* att_bth = att + b*NH*T*T + h*T*T + t*T;// find maxvalfloat maxval = -10000.0f; // TODO something betterfor (int t2 = 0; t2 <= t; t2++) {if (preatt_bth[t2] > maxval) {maxval = preatt_bth[t2];}}// calculate the exp and keep track of sumfloat expsum = 0.0f;for (int t2 = 0; t2 <= t; t2++) {float expv = expf(preatt_bth[t2] - maxval);expsum += expv;att_bth[t2] = expv;}float expsum_inv = expsum == 0.0f ? 0.0f : 1.0f / expsum;// normalize to get the softmaxfor (int t2 = 0; t2 < T; t2++) {if (t2 <= t) {att_bth[t2] *= expsum_inv;} else {// causal attention mask. not strictly necessary to set to zero here// only doing this explicitly for debugging and checking to PyTorchatt_bth[t2] = 0.0f;}}}
}

•  第三步:attention 值与 v 进行矩阵乘法运算

__global__ void attention_value_kernel1(float* out, const float* att, const float* inp,int B, int T, int C, int NH) {int idx = blockIdx.x * blockDim.x + threadIdx.x;int total_threads = B * T * NH;if (idx < total_threads) {int h = idx % NH;int t = (idx / NH) % T;int b = idx / (NH * T);int C3 = C*3;int hs = C / NH; // head sizefloat* out_bth = out + b * T * C + t * C + h * hs;const float* att_bth = att + b*NH*T*T + h*T*T + t*T;for (int i = 0; i < hs; i++) { out_bth[i] = 0.0f; }for (int t2 = 0; t2 <= t; t2++) {const  float* value_t2 = inp + b * T * C3 + t2 * C3 + h * hs + C*2; // +C*2 because it's valuefloat att_btht2 = att_bth[t2];for (int i = 0; i < hs; i++) {out_bth[i] += att_btht2 * value_t2[i];}}}
}

由此完成最基本的 self-attention 的实现,性能数据如下:

block_size   32 | time 238.912872 ms
block_size   64 | time 252.689301 ms
block_size  128 | time 246.945175 ms
block_size  256 | time 261.469421 ms
block_size  512 | time 241.190613 ms

3、flash attention的简单实现(V2)

flash attention 是根据 GPU 的内存体系对 self-attention 做的一个极其重要的优化。

•  首先对于关键参数进行初始化

// these are hardcoded to 32 for nowconst int Bc = 32;const int Br = 32;// renaming these to be consistent with the kernel// const int B = B;const int nh = NH;const int N = T;const int d = C / NH;// moreconst int Tc = ceil((float) N / Bc);const int Tr = ceil((float) N / Br);const float softmax_scale = 1.0 / sqrt(d);

•  然后计算每个 block 所需要的 SRAM,以确保不会溢出

// calculate SRAM size needed per block, ensure we have enough shared memoryint col_tile_size = Bc * d;  // size of Kj, Vjint row_tile_size = Br * d;  // size of Qiconst int sram_size =(2 * col_tile_size * sizeof(float))  // SRAM size for Kj, Vj+ (row_tile_size * sizeof(float))  // SRAM size for Qi+ (Bc * Br * sizeof(float));  // SRAM size for Sint max_sram_size;cudaDeviceGetAttribute(&max_sram_size, cudaDevAttrMaxSharedMemoryPerBlock, 0);if (sram_size > max_sram_size) {printf("Max shared memory: %d, requested shared memory: %d \n", max_sram_size, sram_size);printf("SRAM size exceeds maximum shared memory per block\n");printf("Try decreasing col_tile_size or row_tile_size further\n");exit(1);}

•  为了避免在 flash attention 中进行复杂的索引、reshape 及 permute 操作,首先使用一个kernel 完成这些操作

__global__ void permute_kernel(float* q, float* k, float* v,const float* inp,int B, int N, int NH, int d) {// okay so now, this kernel wants Q,K,V to all be of shape (B, NH, N, d)// but instead, we have a single tensor QKV (inp) of shape (B, N, 3, NH, d)int idx = blockIdx.x * blockDim.x + threadIdx.x;// Q[b][nh_][n][d_] = inp[b][n][0][nh_][d_]if (idx < B * NH * N * d) {int b = idx / (NH * N * d);int rest = idx % (NH * N * d);int nh_ = rest / (N * d);rest = rest % (N * d);int n = rest / d;int d_ = rest % d;int inp_idx = \(b * N * 3 * NH * d)+   (n * 3 * NH * d)+       (0 * NH * d)+          (nh_ * d)+                d_;q[idx] = inp[inp_idx];k[idx] = inp[inp_idx + NH * d];v[idx] = inp[inp_idx + 2 * (NH * d)];}
}

•  之后就是核心环节,flash attention 的实现了,其过程可以参照以下图示:

__global__ void attention_forward_kernel2(const float* Q,const float* K,const float* V,const int N,const int d,const int Tc,const int Tr,const int Bc,const int Br,const float softmax_scale,float* l,float* m,float* O
) {int tx = threadIdx.x;int bx = blockIdx.x; int by = blockIdx.y;  // batch and head index// Offset into Q,K,V,O,l,m - different for each batch and headint qkv_offset = (bx * gridDim.y * N * d) + (by * N * d);  // gridDim.y = nhint lm_offset = (bx * gridDim.y * N) + (by * N);  // offset for l and m// Define SRAM for Q,K,V,Sextern __shared__ float sram[];int tile_size = Bc * d;  // size of Qi, Kj, Vjfloat* Qi = sram;float* Kj = &sram[tile_size];float* Vj = &sram[tile_size * 2];float* S = &sram[tile_size * 3];for (int j = 0; j < Tc; j++) {// Load Kj, Vj to SRAMfor (int x = 0; x < d; x++) {Kj[(tx * d) + x] = K[qkv_offset + (tile_size * j) + (tx * d) + x];Vj[(tx * d) + x] = V[qkv_offset + (tile_size * j) + (tx * d) + x];}__syncthreads();  // such that the inner loop can use the correct Kj, Vjfor (int i = 0; i < Tr; i++)  {// if past the end of the sequence, breakif (i * Br + tx >= N) {break;}// Load Qi to SRAM, l and m to registersfor (int x = 0; x < d; x++) {Qi[(tx * d) + x] = Q[qkv_offset + (tile_size * i) + (tx * d) + x];}float row_m_prev = m[lm_offset + (Br * i) + tx];float row_l_prev = l[lm_offset + (Br * i) + tx];// S = QK^T, row_m = rowmax(S)// S[tx][y] = Sum_{x = 0}^{d-1} {Qi[tx][x] * Kj[y][x]}// row_m = Max_{y = 0}^{Bc-1} S[tx][y]// with causal maskingfloat row_m = -INFINITY;for (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}float sum = 0;for (int x = 0; x < d; x++) {sum += Qi[(tx * d) + x] * Kj[(y * d) + x];}sum *= softmax_scale;if (i * Br + tx < j * Bc + y)sum = -INFINITY;S[(Bc * tx) + y] = sum;if (sum > row_m)row_m = sum;}// implement softmax with causal masking// P = exp(S - row_m), row_l = rowsum(P)// P[tx][y] = exp(S[tx][y] - row_m)float row_l = 0;for (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}if (i * Br + tx < j * Bc + y)S[(Bc * tx) + y] = 0;elseS[(Bc * tx) + y] = __expf(S[(Bc * tx) + y] - row_m);row_l += S[(Bc * tx) + y];}// Compute new m and lfloat row_m_new = max(row_m_prev, row_m);float row_l_new = (__expf(row_m_prev - row_m_new) * row_l_prev) + (__expf(row_m - row_m_new) * row_l);// Write O, l, m to HBMfor (int x = 0; x < d; x++) {float pv = 0;  // Pij * Vjfor (int y = 0; y < Bc; y++) {if (j * Bc + y >= N) {break;}pv += S[(Bc * tx) + y] * Vj[(y * d) + x];}O[qkv_offset + (tile_size * i) + (tx * d) + x] = (1 / row_l_new) \* ((row_l_prev * __expf(row_m_prev - row_m_new) * O[qkv_offset + (tile_size * i) + (tx * d) + x]) \+ (__expf(row_m - row_m_new) * pv));}m[lm_offset + (Br * i) + tx] = row_m_new;l[lm_offset + (Br * i) + tx] = row_l_new;}__syncthreads();  // otherwise, thread can use the wrong Kj, Vj in inner loop}
}

•  以上计算完成后,还需要进行 unpermute 操作,具体如下:

__global__ void unpermute_kernel(const float* inp, float *out, int B, int N, int NH, int d) {// out has shape (B, nh, N, d) but we need to unpermute it to (B, N, nh, d)int idx = blockIdx.x * blockDim.x + threadIdx.x;// out[b][n][nh_][d_] <- inp[b][nh_][n][d_]if (idx < B * NH * N * d) {int b = idx / (NH * N * d);int rest = idx % (NH * N * d);int nh_ = rest / (N * d);rest = rest % (N * d);int n = rest / d;int d_ = rest % d;int other_idx = (b * NH * N * d) + (n * NH * d) + (nh_ * d) + d_;out[other_idx] = inp[idx];}
}

这样就完成了简单的 flash attention 1 的前向过程,性能相较于V1反而有所下降,主要是数据量较小所致,数据如下:

block_size   32 | time 536.709961 ms
block_size   64 | time 526.100098 ms
block_size  128 | time 583.016235 ms
block_size  256 | time 573.955994 ms
block_size  512 | time 534.477051 ms

0****2

self-attention的高效实现

1、 使用 cuBLAS 库函数(V3)

在之前的实现中,所有的操作都是手动实现的,尽管从结果上看完全正确,但是性能上和官方版本仍有较大差距。因此本节将 self-attention 中的矩阵乘法操作使用官方 cuBLAS 库来实现。

在此仅展示两个矩阵乘法的实现过程,首先是q@k.T 如下:

// batched matrix multiply with cuBLASconst float alpha = 1.0f;const float beta = 0.0f;cublasCheck(cublasSgemmStridedBatched(cublas_handle,CUBLAS_OP_T, CUBLAS_OP_N,T, T, HS,&alpha,k, HS, T * HS,q, HS, T * HS,&beta,preatt, T, T * T,B * NH));

然后是att@v ,如下:

 // new approach: first cuBLAS another batched matmul// y = att @ v # (B, nh, T, T) @ (B, nh, T, hs) -> (B, nh, T, hs)cublasCheck(cublasSgemmStridedBatched(cublas_handle,CUBLAS_OP_N, CUBLAS_OP_N,HS, T, T,&alpha,v, HS, T * HS,att, T, T * T,&beta,vaccum, HS, T * HS,B * NH));

性能相较于 V1 版本,提升约百倍以上,数据如下:

block_size   32 | time 4.318913 ms
block_size   64 | time 2.606850 ms
block_size  128 | time 2.034935 ms
block_size  256 | time 2.031407 ms
block_size  512 | time 2.064406 ms

2 、算子融合与 online softmax(V4)

在 V3 基础上,使用 online softmax 并且将 scale 操作融合,具体如下:

__global__ void softmax_forward_kernel5(float* out, float inv_temperature, const float* inp, int N, int T) {// inp, out shape: (N, T, T), where N = B * NH// fuses the multiplication by scale inside attention// directly autoregressive, so we only compute the lower triangular part// uses the online softmax algorithmassert(T % 4  == 0);namespace cg = cooperative_groups;cg::thread_block block = cg::this_thread_block();cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);int idx = blockIdx.x * warp.meta_group_size() + warp.meta_group_rank();if(idx >= N * T) {return;}int own_pos = idx % T;int pos_by_4 = own_pos / 4;// one row of inp, i.e. inp[idx, :] of shape (T,)const float* x = inp + idx * T;// not INF, so we don't get NaNs accidentally when subtracting two values.float maxval = -FLT_MAX;float sumval = 0.0f;const float4* x_vec = reinterpret_cast<const float4*>(x);for (int i = warp.thread_rank(); i < pos_by_4; i += warp.size()) {float4 v = x_vec[i];float old_maxval = maxval;for(int k = 0; k < 4; ++k) {maxval = fmaxf(maxval, vec_at(v, k));}sumval *= expf(inv_temperature * (old_maxval - maxval));for(int k = 0; k < 4; ++k) {sumval += expf(inv_temperature * (vec_at(v, k) - maxval));}}if(4*pos_by_4 + warp.thread_rank() <= own_pos) {float old_maxval = maxval;maxval = fmaxf(maxval, x[4*pos_by_4 + warp.thread_rank()]);sumval *= expf(inv_temperature * (old_maxval - maxval));sumval += expf(inv_temperature * (x[4*pos_by_4 + warp.thread_rank()] - maxval));}float global_maxval = cg::reduce(warp, maxval, cg::greater<float>{});sumval *= expf(inv_temperature * (maxval - global_maxval));float sum = cg::reduce(warp, sumval, cg::plus<float>{});float norm = 1.f / sum;// divide the whole row by the sumfor (int i = warp.thread_rank(); i <= own_pos; i += warp.size()) {// recalculation is faster than doing the round-trip through memory.float ev = expf(inv_temperature * (__ldcs(x + i) - global_maxval));__stcs(out + idx * T + i, ev * norm);}
}

其余操作不变,性能略有提升,数据如下:

block_size   32 | time 1.198167 ms
block_size   64 | time 1.073088 ms
block_size  128 | time 1.042434 ms
block_size  256 | time 1.041798 ms
block_size  512 | time 1.044009 ms

3 、使用 FP16 进行矩阵运算(V5)

在 permute/unpermute 阶段进行 FP32<->FP16 类型转换,如下:

if (!skip_permute || first_run_validation) {permute_kernel_lowp<<<num_blocks, block_size>>>(q, k, v, inp, B, T, NH, HS);}
...if(!skip_permute || first_run_validation) {unpermute_kernel_lowp<<<num_blocks, block_size>>>(vaccum, out, B, T, NH, HS);}

性能数据如下:

block_size   32 | time 0.866851 ms
block_size   64 | time 0.743674 ms
block_size  128 | time 0.703196 ms
block_size  256 | time 0.713902 ms
block_size  512 | time 0.712848 ms

以上几种方法的对比如下,注意坐标轴为指数,计算设备的 A100-80G

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/833172.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

QT--2

Qt界面设计 #include "widget.h" #include "ui_widget.h"Widget::Widget(QWidget *parent): QWidget(parent) {//窗口相关设置this->resize(680,520);this->setFixedSize(680,520);this->setWindowTitle("Tim");this->setWindowFla…

ILI9341显示驱动芯片的使用

ILI9341是一种常见的TFT LCD显示驱动芯片&#xff0c;它在众多的应用中都有广泛的使用。这种芯片的一个显著特点是它支持16位RGB565颜色&#xff0c;这意味着它可以显示多达65536种不同的颜色。这使得ILI9341能够提供鲜艳、生动的色彩效果&#xff0c;对于需要表现丰富色彩的应…

【进程终止】退出信号 | 三种退出情况 | 如何进程终止returnexit_exit

目录 退出码 退出信号 进程终止情况3 如何进程终止 return退出 库函数exit 系统调用函数_exit ​exit和_exit的区别缓冲区 exit _exit 退出码 回顾上篇 代码跑完&#xff0c;结果正确&#xff08;退出码为0&#xff09;代码跑完&#xff0c;结果不正确&#xff08;退…

SpringDI方式及Redis应用场景的分享

1、为什么Spring和IDEA 都不推荐使用 Autowired 注解 大家在使用IDEA开发的时候有没有注意到过一个提示&#xff0c;在字段上使用Spring的依赖注入注解Autowired后会出现如下警告Field injection is not recommended (字段注入是不被推荐的)&#xff1b;但是使用Resource却不会…

Python基础详解二

一&#xff0c;函数 函数是组织好的&#xff0c;可重复使用的&#xff0c;用来实现某个功能的代码段 def myMethod(data):print("数据长度为",len(data))myMethod("dsdsdsds") 函数的定义&#xff1a; def 函数名(传入参数):函数体return 返回值 def m…

关系型数据库MySQL开发要点之多表设计案例详解代码实现

什么是多表设计 项目开发中 在进行数据库表结构设计时 根据数据模型和业务关系 会根据业务需求和业务模块之间的关系分析设计表结构 由于业务之间互相关联 所以表结构之间也存在着各种联系 主要分为以下三种 一对多 每个部门下是有多个员工的 但是一个员工只能归属一个部…

CMakeLists.txt语法规则:foreach 循环基本用法

一. 简介 cmake 中除了 if 条件判断之外&#xff0c;还支持循环语句&#xff0c;包括 foreach()循环、while()循环。 本文学习 CMakeLists.txt语法中的循环语句。 CMakeLists.txt语法中 有两种 循环实现方式&#xff1a;foreach循环与 while循环。 二. CMakeLists.txt语法规则…

免费https证书申请

HTTPS证书&#xff0c;也称为SSL证书&#xff08;Secure Sockets Layer&#xff09;或TLS证书&#xff08;Transport Layer Security&#xff09;&#xff0c;是一种数字证书&#xff0c;用于在互联网通信中确保数据传输的安全性、完整性和真实性。它是基于公钥基础设施&#x…

RISCV 外部GCC 工具链安装@FreeBSD15

在交叉编译的时候&#xff0c;可以使用FreeBSD15默认的工具链&#xff1a;LLVM 也可以使用GCC工具链&#xff0c;GCC可以使用现成pkg包安装&#xff0c;也可以编译安装。 LLVM的特点是高移植性和高效&#xff0c;但学习成本高。GCC的特点是成熟稳定&#xff0c;但优化能力有限…

基于FPGA的数字电子钟VHDL代码Quartus仿真

名称&#xff1a;基于FPGA的数字电子钟VHDL代码Quartus仿真&#xff08;文末获取&#xff09; 软件&#xff1a;Quartus 语言&#xff1a;VHDL 代码功能&#xff1a; 数字电子钟 1)设计一个能显示秒、分、时的24小时数字钟 2)用数码管显示出时&#xff0c;分&#xff0c;…

Unity射击游戏开发教程:(10)创建主界面

主界面开发 玩游戏时,主菜单是事后才想到要做的。实际上几乎每个游戏都有一个主界面。如果你点击打开游戏并立即开始游戏,你会感到非常惊讶。本文将讨论如何创建带有启动新游戏的交互式按钮的主界面/主菜单。 主菜单将是一个全新的场景。我们将添加一个 UI 图像元素,并在图像…

java中的变量、数据类型、人机交互

变量 变量要素 1、类型&#xff1b;每一个变量都需要定义类型&#xff08;强类型&#xff09;其它语言有弱类型&#xff08;js&#xff09; 2、变量名&#xff1b; 3、存储的值&#xff1b; 声明方式&#xff1a; 数据类型 变量名 变量值&#xff1b; public static vo…

Linux磁盘IO、网络IO、零拷贝详解

一、什么是I/O&#xff1f; 在计算机操作系统中&#xff0c;所谓的I/O就是输入&#xff08;input&#xff09;和输出&#xff08;output&#xff09;,也可以理解为读&#xff08;read&#xff09;和写&#xff08;write&#xff09;,针对不同的对象&#xff0c;I/O模式可以划分…

【busybox记录】【shell指令】comm

目录 内容来源&#xff1a; 【GUN】【comm】指令介绍 【busybox】【comm】指令介绍 【linux】【comm】指令介绍 使用示例&#xff1a; 逐行比较两个排序后的文件 - 默认输出 逐行比较两个排序后的文件 - 如果一个文件的排序有问题&#xff0c;那么反错&#xff08;默认&…

泰迪智能科技中职大数据实验室建设(职业院校大数据实验室建设指南)

职校大数据实验室是职校校园文化建设的重要部分&#xff0c;大数据实训室的建设方案应涵盖多个方面&#xff0c;包括硬件设施的配备、软件环境的搭建、课程资源的开发、师资力量的培养以及实践教学体系的完善等。 打造特色&#xff0c;对接生产 社会经济与产业的…

给网站网页PHP页面设置密码访问代码

将MkEncrypt.php文件上传至你网站根目录下或者同级目录下。 MkEncrypt.php里面添加代码&#xff0c;再将调用代码添加到你需要加密的页进行调用 MkEncrypt(‘123456’);括号里面123456修改成你需要设置的密码。 密码正确才能进去页面&#xff0c;进入后会存下cookies值&…

XAMPP是什么?XAMPP好不好用?

XAMPP是一个免费且开源的软件套件&#xff0c;用于在个人计算机上轻松搭建和运行 Apache 服务器、MySQL 数据库、PHP 和 Perl&#xff0c;让用户可以在个人电脑上搭建服务器环境的平台。 XAMPP的由来是 X(表示跨平台)、Apache、MySQL、PHP 和 Perl 的首字母缩写。 它集成了这…

【软测学习笔记】Python入门Day02

&#x1f31f;博主主页&#xff1a;我是一只海绵派大星 &#x1f4da;专栏分类&#xff1a;软件测试笔记 &#x1f4da;参考教程&#xff1a;黑马教程❤️感谢大家点赞&#x1f44d;收藏⭐评论✍️ python安装 1、进入Python的官方下载页面&#xff1a; Download Python | Py…

欧式聚类提取-------PCL

欧式聚类 std::vector<pcl::PointCloud<pcl::PointXYZ>::Ptr> PclTool::euclideanClustering(const pcl::PointCloud<pcl::PointXYZ>::Ptr& cloud) {std::vector<pcl::PointCloud<pcl::PointXYZ>::Ptr> clustered_clouds;// 下采样pcl::Vox…

小白入门:创建一个SpringBoot项目

前言 我们在创建SpringBoot项目时候&#xff0c;会出现不确定和报错的情况很多&#xff0c;大家可以按照我的做法来简单创建一个SpringBoot项目 1.环境配置 下载安装并配置jdk1.8下载apache mavenidea软件 2.开始创建项目 Server URL&#xff1a;初始是start.spring.io,我…