引子

下面给出两个极其相似的代码，运行出的时间却是有很大差别：
代码一

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>const uint32_t MAX_THREADS = 16;void* ThreadFunc(void* pArg)
{for (int i = 0; i < 1000000000; ++i) // 10亿次累加操作{++*(uint64_t*)pArg;}return NULL;
}int main() {static uint64_t aulArr[MAX_THREADS * 8];pthread_t aulThreadID[MAX_THREADS];auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i]));}for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_join(aulThreadID[i], nullptr));}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

耗时： 26396ms

代码二

#include <stdio.h>
#include <pthread.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>const uint32_t MAX_THREADS = 16;void* ThreadFunc(void* pArg)
{for (int i = 0; i < 1000000000; ++i) // 10亿次累加操作{++*(uint64_t*)pArg;}return NULL;
}int main() {static uint64_t aulArr[MAX_THREADS * 8];pthread_t aulThreadID[MAX_THREADS];auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_create(&aulThreadID[i], nullptr, ThreadFunc, &aulArr[i * 8]));}for (int i = 0; i < MAX_THREADS; ++i){assert(0 == pthread_join(aulThreadID[i], nullptr));}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

耗时： 6762ms

这两者的主要差别就在于pthread_create传入的一个是aulArr[i]一个是aulArr[i * 8]

CPU Cache对于并发的影响

cpu cache在做数据同步的时候，有个最小的单位：cache line，当前主流CPU为64字节。
多个CPU读写相同的Cache line的时候需要做一致性同步，多CPU访问相同的Cache Line地址，数据会被反复写脏，频繁进行一致性同步。当多CPU访问不同的Cache Line地址时，无需一致性同步。
在上面的程序中：
static uint64_t aulArr[MAX_THREADS * 8];
占用的数据长度为：8byte * 8 * 16；
8byte * 8=64byte
程序一，每个线程在当前CPU读取数据时,访问的是同一块cache line
程序二，每个线程在当前CPU读取数据时，访问的是不同块的cache line,避免了对一个流水线的反复擦写，效率直线提升。
在这里插入图片描述

读写顺序对性能的影响

CPU会有一个预读，顺带着将需要的块儿旁边的块儿一起读出来放到cache中。所以当我们顺序读的时候就不需要从内存里面读了，可以直接在缓存里面读。
顺序读

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字节地址对齐，保证每一块正好是一个CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){for (int j = 0; j < 64; ++j){result ^= memory[i][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

乱序读

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字节地址对齐，保证每一块正好是一个CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){int k = i * 5183 % BLOCK_SIZE;  // 人为打乱顺序for (int j = 0; j < 64; ++j){result ^= memory[k][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

顺序读耗时13547ms，随机乱序读耗时21395ms。
如果一定要随机读的话该怎么优化呢？
如果我们知道我们下一轮读取的数据，并且不是要立即访问这个地址的话，使用_mm_prefetch指令优化，告诉CPU提前预读下一轮循环的cacheline
有关该指令可以参考官方文档：https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/84szxsww(v=vs.100)
使用该命令后，再看看运行时间：

#include <stdio.h>
#include <stdint.h>
#include <assert.h>
#include<chrono>
#include "string.h"
#include "xmmintrin.h"int main() {const uint32_t BLOCK_SIZE = 8 << 20;// 64字节地址对齐，保证每一块正好是一个CacheLinestatic char memory[BLOCK_SIZE][64] __attribute__((aligned(64)));assert((uint64_t)memory % 64 == 0);memset(memory, 0x3c, sizeof(memory));int n = 10;auto begin = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());while (n--){char result = 0;for (int i = 0; i < BLOCK_SIZE; ++i){int next_k = (i + 1) * 5183 % BLOCK_SIZE;_mm_prefetch(&memory[next_k][0], _MM_HINT_T0);int k = i * 5183 % BLOCK_SIZE;  // 人为打乱顺序for (int j = 0; j < 64; ++j){result ^= memory[k][j];}}}auto end = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now().time_since_epoch());printf("%lld",end.count() - begin.count());
}

从原来的21395ms优化到15291ms

字节对齐对Cache的影响

在2GB内存，int64为单元进行26亿次异或。分别测试地址对齐与非对齐在顺序访问和随机访问下的耗时

	非地址对齐	地址对齐	耗时比
顺序访问	7.8s	7.7s	1.01:1
随机访问	90s	80s	1.125:1

在顺序访问时，Cache命中率高，且CPU预读，此时差别不大。
在随机访问的情况下，Cache命中率几乎为0，有1/8概率横跨2个cacheline，此时需读两次内存，此时耗时比大概为：7 / 8 * 1 + 1 / 8 * 2 = 1.125
结论就是：
1、cacheline 内部访问非字节对齐变量差别不大
2、跨cacheline访问代价主要为额外的内存读取开销
所以除了网络协议以外，避免出现1字节对齐的情况。可以通过调整成员顺序，减少内存开销。