Why is processing a sorted array faster than an unsorted array?

这是我在逛 Stack Overflow 时遇见的一个高分问题：Why is processing a sorted array faster than an unsorted array?，我觉得这是一个非常好的用来讲分支预测（Branch Prediction）的例子，分享给大家看看

一、问题引入

先看这个代码：

#include <algorithm>
#include <ctime>
#include <iostream>
#include <stdint.h>int main() {uint32_t arraySize = 20000;uint32_t data[arraySize];for (uint32_t i = 0; i < arraySize; ++ i) {data[i] = std::rand() % 256;}// !!! With this, the next loop runs fasterstd::sort(data, data + arraySize);clock_t start = clock();uint64_t sum = 0;for (uint32_t cnt = 0; cnt < 100000; ++ cnt) {for (uint32_t i = 0; i < arraySize; ++ i) {if (data[i] > 128) {sum += data[i];}}}double processTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;std::cout << "processTime: " << processTime << std::endl;std::cout << "sum: " << sum << std::endl;return 0;
};

注意：这里特地没有加随机数种子是为了确保 data 数组中的伪随机数始终不变，为接下来的对比分析做准备，尽可能减少实验中的变量

我们编译并运行这段代码（gcc 版本 4.1.2，太高的话会被优化掉）：

$ g++ a.cpp -o a -O3
$ ./a
processTime: 1.78
sum: 191444000000

下面，把下面的这一行注释掉，然后再编译并运行：

std::sort(data, data + arraySize);

$ g++ a.cpp -o b -O3
$ ./b
processTime: 10.06
sum: 191444000000

注意到了吗？去掉那一行排序的代码后，整个计算时间被延长了十倍！

二、是 Cache Miss 导致的吗？

答案显然是否定的。cache miss 率并不会因为数组是否排序而改变，因为两份代码取数据的顺序是一样的，数据量大小是一样的，数据布局也是一样的，并且在同一台机器上运行，并没有任何差别，所以可以肯定的是：和 cache miss 无任何关系

为了验证我们的分析，可以用 valgrind 提供的 cachegrind tool 查看 cache miss 率：

$ valgrind --tool=cachegrind ./a
==26548== Cachegrind, a cache and branch-prediction profiler
==26548== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==26548== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==26548== Command: ./a
==26548==
--26548-- warning: L3 cache found, using its data for the LL simulation.
--26548-- warning: specified LL cache: line_size 64  assoc 20  total_size 15,728,640
--26548-- warning: simulated LL cache: line_size 64  assoc 30  total_size 15,728,640
processTime: 68.57
sum: 191444000000
==26548==
==26548== I   refs:      14,000,637,620
==26548== I1  misses:             1,327
==26548== LLi misses:             1,293
==26548== I1  miss rate:           0.00%
==26548== LLi miss rate:           0.00%
==26548==
==26548== D   refs:       2,001,434,596  (2,000,993,511 rd   + 441,085 wr)
==26548== D1  misses:       125,115,133  (  125,112,303 rd   +   2,830 wr)
==26548== LLd misses:             7,085  (        4,770 rd   +   2,315 wr)
==26548== D1  miss rate:            6.3% (          6.3%     +     0.6%  )
==26548== LLd miss rate:            0.0% (          0.0%     +     0.5%  )
==26548==
==26548== LL refs:          125,116,460  (  125,113,630 rd   +   2,830 wr)
==26548== LL misses:              8,378  (        6,063 rd   +   2,315 wr)
==26548== LL miss rate:             0.0% (          0.0%     +     0.5%  )

$ valgrind --tool=cachegrind ./b
==13898== Cachegrind, a cache and branch-prediction profiler
==13898== Copyright (C) 2002-2015, and GNU GPL'd, by Nicholas Nethercote et al.
==13898== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==13898== Command: ./b
==13898==
--13898-- warning: L3 cache found, using its data for the LL simulation.
--13898-- warning: specified LL cache: line_size 64  assoc 20  total_size 15,728,640
--13898-- warning: simulated LL cache: line_size 64  assoc 30  total_size 15,728,640
processTime: 76.7
sum: 191444000000
==13898==
==13898== I   refs:      13,998,930,559
==13898== I1  misses:             1,316
==13898== LLi misses:             1,281
==13898== I1  miss rate:           0.00%
==13898== LLi miss rate:           0.00%
==13898==
==13898== D   refs:       2,000,938,800  (2,000,663,898 rd   + 274,902 wr)
==13898== D1  misses:       125,010,958  (  125,008,167 rd   +   2,791 wr)
==13898== LLd misses:             7,083  (        4,768 rd   +   2,315 wr)
==13898== D1  miss rate:            6.2% (          6.2%     +     1.0%  )
==13898== LLd miss rate:            0.0% (          0.0%     +     0.8%  )
==13898==
==13898== LL refs:          125,012,274  (  125,009,483 rd   +   2,791 wr)
==13898== LL misses:              8,364  (        6,049 rd   +   2,315 wr)
==13898== LL miss rate:             0.0% (          0.0%     +     0.8%  )

对比可以发现，他们俩的 cache miss rate 和 cache miss 数几乎相同，因此确实和 cache miss 无关

三、Branch Prediction

使用到 valgrind 提供的 callgrind tool 可以查看分支预测失败率：

$ valgrind --tool=callgrind --branch-sim=yes ./a
==29373== Callgrind, a call-graph generating cache profiler
==29373== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==29373== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==29373== Command: ./a
==29373==
==29373== For interactive control, run 'callgrind_control -h'.
processTime: 288.68
sum: 191444000000
==29373==
==29373== Events    : Ir Bc Bcm Bi Bim
==29373== Collected : 14000637633 4000864744 293254 23654 395
==29373==
==29373== I   refs:      14,000,637,633
==29373==
==29373== Branches:       4,000,888,398  (4,000,864,744 cond + 23,654 ind)
==29373== Mispredicts:          293,649  (      293,254 cond +    395 ind)
==29373== Mispred rate:             0.0% (          0.0%     +    1.7%   )

可以看到，在计算 sum 之前对数组排序，分支预测失败率非常低，几乎相当于没有失败

$ valgrind --tool=callgrind --branch-sim=yes ./b
==23202== Callgrind, a call-graph generating cache profiler
==23202== Copyright (C) 2002-2015, and GNU GPL'd, by Josef Weidendorfer et al.
==23202== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==23202== Command: ./b
==23202==
==23202== For interactive control, run 'callgrind_control -h'.
processTime: 287.12
sum: 191444000000
==23202==
==23202== Events    : Ir Bc Bcm Bi Bim
==23202== Collected : 13998930783 4000477534 1003409950 23654 395
==23202==
==23202== I   refs:      13,998,930,783
==23202==
==23202== Branches:       4,000,501,188  (4,000,477,534 cond + 23,654 ind)
==23202== Mispredicts:    1,003,410,345  (1,003,409,950 cond +    395 ind)
==23202== Mispred rate:            25.1% (         25.1%     +    1.7%   )

而这个未排序的就不同了，分支预测失败率达到了 25%。因此可以确定的是：两份代码在运行时 CPU 分支预测失败率不同导致了运行时间的不同