一个结构体数组,有1亿个元素,每个元素都要初始化为相同的值,如果没有现成的语法直接支持这样的初始化操作,就得用for循环写,会不会非常耗时?
如果结构体里的成员都是一些简单的基本数据类型,整个结构体才几十个字节,即使有1亿个元素,用for循环赋值,程序执行时间也只要10^8纳秒级别,0.1秒的样子。编译器优化+高速缓存命中,速度已经飞快了,不用操心那么多。循环展开这些优化方法,编译器优化都可能帮你做了,手动优化代码可能基本没效果(1e8+5即1亿零5个元素,多出5个零头是为了回头测试多线程对不对)
#include <iostream>
using namespace std;struct Test
{char a;int b;float c;
};int main()
{clock_t t1 = clock();const int data_num = 1e8 + 5;Test* array = new Test[data_num];if (array == NULL){cout << "memory alloc error" << endl;return -1;}array[0].a = 'a';array[0].b = 123;array[0].c = 123.45;for (int i = 1; i < data_num; i++){array[i] = array[0];}cout << array[data_num - 1].a << endl;cout << array[data_num - 1].b << endl;cout << array[data_num - 1].c << endl;clock_t t2 = clock();cout << t2 - t1 << "毫秒" << endl;return 0;
}
下面使用多线程,而且每个线程使用翻倍memcpy(按1、2、4、8……翻倍进行copy)
#include <thread>
#include <iostream>
using namespace std;struct Test
{char a;int b;float c;
};const int data_num = 1e8 + 5;
const int thread_num = 4;void memcpy_thread(Test* array, int tid)
{int base_pos = data_num / thread_num * tid;int cur_pos = 1;while (cur_pos * 2 <= data_num / thread_num){memcpy_s(array + base_pos + cur_pos, sizeof(Test) * cur_pos, array + base_pos, sizeof(Test) * cur_pos);cur_pos *= 2;}memcpy_s(array + base_pos + cur_pos, sizeof(Test) * (data_num / thread_num - cur_pos), array + base_pos, sizeof(Test) * (data_num / thread_num - cur_pos));
}int main()
{clock_t t1 = clock();Test* array = new Test[data_num];if (array == NULL){cout << "memory alloc error" << endl;return -1;}for (int i = 0; i < thread_num; i++){int index = data_num / thread_num * i;array[index].a = 'a';array[index].b = 123;array[index].c = 123.45;}thread td[thread_num];for (int i = 0; i < thread_num; i++){td[i] = thread(&memcpy_thread, array, i);}for (int i = 0; i < thread_num; i++){td[i].join();}for (int i = data_num / thread_num * thread_num; i < data_num; i++){array[i].a = 'a';array[i].b = 123;array[i].c = 123.45;}cout << array[data_num - 1].a << endl;cout << array[data_num - 1].b << endl;cout << array[data_num - 1].c << endl;clock_t t2 = clock();cout << t2 - t1 << "毫秒" << endl;return 0;
}
1亿零5个元素(多出5个零头是为了测试多线程对不对),多线程+翻倍memcpy,用时大概70毫秒。之前直接用for循环,用时大概150毫秒,多线程和翻倍memcpy的加速效果似乎都不明显。多线程加速效果不明显,3个线程以上基本没有加速效果了,猜测是硬件资源的并行度有限(这样说可能不准确),可能是竞争之类的原因导致的,或者是因为总线宽度有限?