RK3588上CPU和GPU算力以及opencv resize的性能对比测试

RK3588上CPU和GPU算力以及opencv resize的性能对比测试

  • 一.背景
  • 二.小结
  • 三.相关链接
  • 四.操作步骤
    • 1.环境搭建
      • A.安装依赖
      • B.设置GPU为高性能模式
      • C.获取GPU信息
      • D.获取CPU信息
    • 2.调用OpenCL SDK获取GPU信息
    • 3.使用OpenCL API计算矩阵乘
    • 4.使用clpeak测试GPU的性能
    • 5.使用OpenBLAS测试CPU的算力
    • 6.分别用CPU与OpenCL测试opencv resize的性能
      • A.编译OpenCV支持OpenCL
      • B.运行OpenCV测试程序

一.背景

  • 希望对比RK3588上CPU和Mali-GPU的性能差异
  • Mali-GPU算力测试采用clpeak
  • CPU-FP32的性能测试采用Openblas(开启了NEON优化)
  • 分别用CPU和opencl测试opencv resize在不同算法下的性能:从32x32放大到8192x8192再缩放回32x32,循环100次

二.小结

  • GPU型号: Mali-LODX r0p0 Mali-G610 4 cores r0p0 0xA867
  • GPU FP32(clpeak): 441.95 GFLOPS
  • CPU FP32(openblas+neon): 53.68 GFLOPS
  • 插值方法:INTER_NEAREST CPU耗时(秒):3.01526 GPU耗时(秒):0.0672681
  • 插值方法:INTER_LINEAR CPU耗时(秒):5.3227 GPU耗时(秒):0.0189366
  • 插值方法:INTER_CUBIC CPU耗时(秒):8.22734 GPU耗时(秒):11.6337
  • 插值方法:INTER_AREA CPU耗时(秒):20.4999 GPU耗时(秒):27.3197
  • 插值方法:INTER_LANCZOS4 CPU耗时(秒):29.3602 GPU耗时(秒):43.9484

三.相关链接

  • opencv编译

四.操作步骤

1.环境搭建

A.安装依赖

mv /lib/aarch64-linux-gnu/libOpenCL.so.1 /lib/aarch64-linux-gnu/libOpenCL.so.1.bk
ln -s /usr/lib/aarch64-linux-gnu/libmali.so /lib/aarch64-linux-gnu/libOpenCL.so.1sudo apt install opencl-headers
sudo apt install ocl-icd-libopencl1
sudo apt install ocl-icd-opencl-dev
sudo apt install clinfo

B.设置GPU为高性能模式

echo performance> /sys/class/devfreq/fb000000.gpu/governor
echo performance> /sys/class/devfreq/fdab0000.npu/governor

C.获取GPU信息

cat /sys/class/misc/mali0/device/gpuinfo
clinfo

输出

Mali-G610 4 cores r0p0 0xA867Number of platforms                               1Platform Name                                   ARM PlatformPlatform Vendor                                 ARMPlatform Version                                OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Platform Profile                                FULL_PROFILEPlatform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclPlatform Host timer resolution                  1nsPlatform Extensions function suffix             ARMPlatform Name                                   ARM Platform
Number of devices                                 1
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device Name                                     Mali-LODX r0p0Device Vendor                                   ARMDevice Vendor ID                                0xa8670000Device Version                                  OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Driver Version                                  2.1Device OpenCL C Version                         OpenCL C 2.0 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03Device Type                                     GPUDevice Profile                                  FULL_PROFILEDevice Available                                YesCompiler Available                              YesLinker Available                                YesMax compute units                               4Max clock frequency                             1000MHzDevice Partition                                (core)Max number of sub-devices                     0Supported partition types                     NoneSupported affinity domains                    (n/a)Max work item dimensions                        3Max work item sizes                             1024x1024x1024Max work group size                             1024Preferred work group size multiple              16Max sub-groups per work group                   64Preferred / native vector sizeschar                                                16 / 4short                                                8 / 2int                                                  4 / 1long                                                 2 / 1half                                                 8 / 2        (cl_khr_fp16)float                                                4 / 1double                                               0 / 0        (n/a)Half-precision Floating-point support           (cl_khr_fp16)Denormals                                     YesInfinity and NANs                             YesRound to nearest                              YesRound to zero                                 YesRound to infinity                             YesIEEE754-2008 fused multiply-add               YesSupport is emulated in software               NoSingle-precision Floating-point support         (core)Denormals                                     YesInfinity and NANs                             YesRound to nearest                              YesRound to zero                                 YesRound to infinity                             YesIEEE754-2008 fused multiply-add               YesSupport is emulated in software               NoCorrectly-rounded divide and sqrt operations  NoDouble-precision Floating-point support         (n/a)Address bits                                    64, Little-EndianGlobal memory size                              16643870720 (15.5GiB)Error Correction support                        NoMax memory allocation                           16643870720 (15.5GiB)Unified memory for Host and Device              YesShared Virtual Memory (SVM) capabilities        (core)Coarse-grained buffer sharing                 YesFine-grained buffer sharing                   NoFine-grained system sharing                   NoAtomics                                       NoMinimum alignment for any data type             128 bytesAlignment of base address                       1024 bits (128 bytes)Preferred alignment for atomicsSVM                                           0 bytesGlobal                                        0 bytesLocal                                         0 bytesMax size for global variable                    65536 (64KiB)Preferred total size of global vars             0Global Memory cache type                        Read/WriteGlobal Memory cache size                        1048576 (1024KiB)Global Memory cache line size                   64 bytesImage support                                   YesMax number of samplers per kernel             16Max size for 1D images from buffer            65536 pixelsMax 1D or 2D image array size                 2048 imagesBase address alignment for 2D image buffers   32 bytesPitch alignment for 2D image buffers          64 pixelsMax 2D image size                             65536x65536 pixelsMax 3D image size                             65536x65536x65536 pixelsMax number of read image args                 128Max number of write image args                64Max number of read/write image args           64Max number of pipe args                         16Max active pipe reservations                    1Max pipe packet size                            1024Local memory type                               GlobalLocal memory size                               32768 (32KiB)Max number of constant args                     128Max constant buffer size                        16643870720 (15.5GiB)Max size of kernel argument                     1024Queue properties (on host)Out-of-order execution                        YesProfiling                                     YesQueue properties (on device)Out-of-order execution                        YesProfiling                                     YesPreferred size                                2097152 (2MiB)Max size                                      16777216 (16MiB)Max queues on device                            1Max events on device                            1024Prefer user sync for interop                    NoProfiling timer resolution                      1000nsExecution capabilitiesRun OpenCL kernels                            YesRun native kernels                            NoSub-group independent forward progress        YesIL version                                    SPIR-V_1.0SPIR versions                                 <printDeviceInfo:161: get CL_DEVICE_SPIR_VERSIONS size : error -30>printf() buffer size                            1048576 (1024KiB)Built-in kernels                                (n/a)Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_icd cl_khr_egl_image cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_subgroups cl_khr_subgroup_extended_types cl_khr_subgroup_non_uniform_vote cl_khr_subgroup_ballot cl_khr_il_program cl_khr_priority_hints cl_khr_create_command_queue cl_khr_spirv_no_integer_wrap_decoration cl_khr_extended_versioning cl_khr_device_uuid cl_arm_core_id cl_arm_printf cl_arm_non_uniform_work_group_size cl_arm_import_memory cl_arm_import_memory_dma_buf cl_arm_import_memory_host cl_arm_integer_dot_product_int8 cl_arm_integer_dot_product_accumulate_int8 cl_arm_integer_dot_product_accumulate_saturate_int8 cl_arm_scheduling_controls cl_arm_controlled_kernel_termination cl_ext_cxx_for_openclNULL platform behaviorclGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM PlatformclGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]clCreateContext(NULL, ...) [default]            Success [ARM]clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platformclCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)Platform Name                                 ARM PlatformDevice Name                                   Mali-LODX r0p0

D.获取CPU信息

lscpu

输出

Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              1
Core(s) per socket:              2
Socket(s):                       3
Vendor ID:                       ARM
Model:                           0
Model name:                      Cortex-A55
Stepping:                        r2p0
CPU max MHz:                     2208.0000
CPU min MHz:                     408.0000
BogoMIPS:                        48.00
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        1 MiB
L3 cache:                        3 MiB
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; __user pointer sanitization
Vulnerability Spectre v2:        Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

2.调用OpenCL SDK获取GPU信息

cat > cl_query.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>int main() {cl_platform_id *platforms = NULL;cl_uint num_platforms = 0;// 获取可用的平台数量cl_int clStatus = clGetPlatformIDs(0, NULL, &num_platforms);platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * num_platforms);// 获取所有平台IDclStatus = clGetPlatformIDs(num_platforms, platforms, NULL);printf("OpenCL平台数量: %d\n", num_platforms);// 遍历每个平台for (cl_uint i = 0; i < num_platforms; ++i) {char buffer[10240];printf("\n平台 %d:\n", i+1);// 获取平台名称clGetPlatformInfo(platforms[i], CL_PLATFORM_NAME, sizeof(buffer), buffer, NULL);printf("  名称: %s\n", buffer);// 获取平台供应商clGetPlatformInfo(platforms[i], CL_PLATFORM_VENDOR, sizeof(buffer), buffer, NULL);printf("  供应商: %s\n", buffer);// 获取平台版本clGetPlatformInfo(platforms[i], CL_PLATFORM_VERSION, sizeof(buffer), buffer, NULL);printf("  版本: %s\n", buffer);// 获取设备数量cl_uint num_devices = 0;clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, 0, NULL, &num_devices);cl_device_id *devices = (cl_device_id*) malloc(sizeof(cl_device_id) * num_devices);clGetDeviceIDs(platforms[i], CL_DEVICE_TYPE_ALL, num_devices, devices, NULL);// 遍历每个设备for (cl_uint j = 0; j < num_devices; ++j) {printf("  设备 %d:\n", j+1);// 获取设备名称clGetDeviceInfo(devices[j], CL_DEVICE_NAME, sizeof(buffer), buffer, NULL);printf("    名称: %s\n", buffer);// 获取设备类型cl_device_type device_type;clGetDeviceInfo(devices[j], CL_DEVICE_TYPE, sizeof(device_type), &device_type, NULL);if (device_type & CL_DEVICE_TYPE_CPU)printf("    类型: CPU\n");if (device_type & CL_DEVICE_TYPE_GPU)printf("    类型: GPU\n");if (device_type & CL_DEVICE_TYPE_ACCELERATOR)printf("    类型: 加速器\n");// 获取计算单元数量cl_uint compute_units;clGetDeviceInfo(devices[j], CL_DEVICE_MAX_COMPUTE_UNITS, sizeof(compute_units), &compute_units, NULL);printf("    计算单元数: %d\n", compute_units);// 获取全局内存大小cl_ulong global_mem;clGetDeviceInfo(devices[j], CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(global_mem), &global_mem, NULL);printf("    全局内存大小: %llu MB\n", (unsigned long long)(global_mem / (1024 * 1024)));}free(devices);}free(platforms);return 0;
}
EOFgcc -o cl_query cl_query.c -lOpenCL
./cl_query

输出

OpenCL平台数量: 1平台 1:名称: ARM Platform供应商: ARM版本: OpenCL 2.1 v1.g6p0-01eac0.ba52c908d926792b8f5fe28f383a2b03设备 1:
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.名称: Mali-LODX r0p0类型: GPU计算单元数: 4全局内存大小: 15872 MB

3.使用OpenCL API计算矩阵乘

cat > matmul.c <<-'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <CL/cl.h>
#include <time.h>
#include <sys/time.h>#define MATRIX_SIZE 8192
#define TILE_SIZE 32// 获取当前时间(秒),用于计算耗时
double get_current_time() {struct timeval tp;gettimeofday(&tp, NULL);return (double)(tp.tv_sec) + (double)(tp.tv_usec) / 1e6;
}#define xstr(s) str(s)
#define str(s) #sconst char *kernelSource = "                                  \n" \
"__kernel void mat_mul_optimized(const int N,                 \n" \
"                                __global float* A,           \n" \
"                                __global float* B,           \n" \
"                                __global float* C) {         \n" \
"    const int TILE_SIZE = " xstr(TILE_SIZE) ";               \n" \
"    __local float Asub[TILE_SIZE][TILE_SIZE];                \n" \
"    __local float Bsub[TILE_SIZE][TILE_SIZE];                \n" \
"    int global_row = get_global_id(1);                       \n" \
"    int global_col = get_global_id(0);                       \n" \
"    int local_row = get_local_id(1);                         \n" \
"    int local_col = get_local_id(0);                         \n" \
"    float sum = 0.0f;                                        \n" \
"    int numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;          \n" \
"    for (int t = 0; t < numTiles; ++t) {                     \n" \
"        int tiled_row = global_row;                          \n" \
"        int tiled_col = t * TILE_SIZE + local_col;           \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Asub[local_row][local_col] = A[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Asub[local_row][local_col] = 0.0f;               \n" \
"        tiled_row = t * TILE_SIZE + local_row;               \n" \
"        tiled_col = global_col;                              \n" \
"        if (tiled_row < N && tiled_col < N)                  \n" \
"            Bsub[local_row][local_col] = B[tiled_row * N + tiled_col];\n" \
"        else                                                 \n" \
"            Bsub[local_row][local_col] = 0.0f;               \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"        for (int k = 0; k < TILE_SIZE; ++k) {                \n" \
"            sum += Asub[local_row][k] * Bsub[k][local_col];  \n" \
"        }                                                    \n" \
"        barrier(CLK_LOCAL_MEM_FENCE);                        \n" \
"    }                                                        \n" \
"    if (global_row < N && global_col < N)                    \n" \
"        C[global_row * N + global_col] = sum;                \n" \
"}                                                            \n";int main() {int N = MATRIX_SIZE;size_t bytes = N * N * sizeof(float);// 分配主机内存float *h_A = (float*)malloc(bytes);float *h_B = (float*)malloc(bytes);float *h_C = (float*)malloc(bytes);// 初始化矩阵for(int i = 0; i < N*N; i++) {h_A[i] = 1.0f;h_B[i] = 1.0f;}// 获取平台和设备信息cl_platform_id platformId = NULL;cl_device_id deviceID = NULL;cl_uint retNumDevices;cl_uint retNumPlatforms;cl_int ret = clGetPlatformIDs(1, &platformId, &retNumPlatforms);ret = clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1, &deviceID, &retNumDevices);// 创建 OpenCL 上下文cl_context context = clCreateContext(NULL, 1, &deviceID, NULL, NULL, &ret);// 创建命令队列cl_command_queue commandQueue = clCreateCommandQueue(context, deviceID, 0, &ret);// 创建内存缓冲区cl_mem d_A = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_B = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, &ret);cl_mem d_C = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, &ret);// 将数据写入缓冲区ret = clEnqueueWriteBuffer(commandQueue, d_A, CL_TRUE, 0, bytes, h_A, 0, NULL, NULL);ret = clEnqueueWriteBuffer(commandQueue, d_B, CL_TRUE, 0, bytes, h_B, 0, NULL, NULL);// 记录编译开始时间double compile_start = get_current_time();// 创建程序对象cl_program program = clCreateProgramWithSource(context, 1, (const char**)&kernelSource, NULL, &ret);// 编译内核程序ret = clBuildProgram(program, 1, &deviceID, NULL, NULL, NULL);// 检查编译错误if (ret != CL_SUCCESS) {size_t log_size;clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);char *log = (char *)malloc(log_size);clGetProgramBuildInfo(program, deviceID, CL_PROGRAM_BUILD_LOG, log_size, log, NULL);printf("CL Compilation failed:\n%s\n", log);free(log);return 1;}// 记录编译结束时间double compile_end = get_current_time();double compile_time = compile_end - compile_start;// 创建 OpenCL 内核cl_kernel kernel = clCreateKernel(program, "mat_mul_optimized", &ret);// 设置内核参数ret = clSetKernelArg(kernel, 0, sizeof(int), (void*)&N);ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&d_A);ret = clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&d_B);ret = clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&d_C);// 定义全局和本地工作区大小size_t local[2] = {TILE_SIZE, TILE_SIZE};size_t global[2] = {(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE,(size_t)((N + TILE_SIZE - 1) / TILE_SIZE) * TILE_SIZE};// 记录第一次内核执行开始时间double launch_start = get_current_time();// 执行内核ret = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, global, local, 0, NULL, NULL);printf("clEnqueueNDRangeKernel:%d\n",ret);// 等待命令队列执行完成clFinish(commandQueue);// 记录第一次内核执行结束时间double launch_end = get_current_time();double launch_time = launch_end - launch_start;// 读取结果ret = clEnqueueReadBuffer(commandQueue, d_C, CL_TRUE, 0, bytes, h_C, 0, NULL, NULL);// 计算 GFLOPSdouble total_ops = 2.0 * N * N * N;double gflops = (total_ops / 1e9) / launch_time;// 输出结果printf("编译时间: %f 秒\n", compile_time);printf("第一次内核执行时间: %f 秒\n", launch_time);printf("计算性能: %f GFLOPS\n", gflops);// 释放资源ret = clFlush(commandQueue);ret = clFinish(commandQueue);ret = clReleaseKernel(kernel);ret = clReleaseProgram(program);ret = clReleaseMemObject(d_A);ret = clReleaseMemObject(d_B);ret = clReleaseMemObject(d_C);ret = clReleaseCommandQueue(commandQueue);ret = clReleaseContext(context);free(h_A);free(h_B);free(h_C);return 0;
}EOF
gcc -o matmul matmul.c -lOpenCL
./matmul

输出

编译时间: 0.031085 秒
第一次内核执行时间: 62.258528 秒
计算性能: 17.660418 GFLOPS

4.使用clpeak测试GPU的性能

git clone https://gitcode.com/gh_mirrors/cl/clpeak.git
git submodule update --init --recursive --remote
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
cmake --build .
./clpeak

输出

Platform: ARM Platform
arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.Device: Mali-LODX r0p0Driver version  : 2.1 (Linux ARM64)Compute units   : 4Clock frequency : 1000 MHzGlobal memory bandwidth (GBPS)float   : 25.71float2  : 24.45float4  : 23.70float8  : 12.05float16 : 12.01Single-precision compute (GFLOPS)float   : 441.77float2  : 470.27float4  : 466.52float8  : 435.65float16 : 411.38Half-precision compute (GFLOPS)half   : 441.96half2  : 878.25half4  : 911.51half8  : 886.19half16 : 846.44No double precision support! SkippedInteger compute (GIOPS)int   : 124.96int2  : 125.71int4  : 125.16int8  : 123.82int16 : 124.24Integer compute Fast 24bit (GIOPS)int   : 125.16int2  : 125.63int4  : 125.20int8  : 123.73int16 : 124.33Integer char (8bit) compute (GIOPS)char   : 126.47char2  : 251.55char4  : 498.03char8  : 497.37char16 : 491.94Integer short (16bit) compute (GIOPS)short   : 126.31short2  : 250.90short4  : 249.47short8  : 248.51short16 : 245.30Transfer bandwidth (GBPS)enqueueWriteBuffer              : 8.54enqueueReadBuffer               : 9.97enqueueWriteBuffer non-blocking : 8.55enqueueReadBuffer non-blocking  : 9.99enqueueMapBuffer(for read)      : 61.66memcpy from mapped ptr        : 11.95enqueueUnmap(after write)       : 62.02memcpy to mapped ptr          : 11.89Kernel launch latency : 26.81 us

5.使用OpenBLAS测试CPU的算力

git clone https://github.com/xianyi/OpenBLAS.git
cd OpenBLAS
make TARGET=ARMV8
make install
cd benchmark
make TARGET=ARMV8 sgemm
cc sgemm.o -o sgemm /opt/OpenBLAS/lib/libopenblas.so -Wl,-rpath=/opt/OpenBLAS/lib/
export OPENBLAS_NUM_THREADS=8
export OPENBLAS_LOOPS=10
export OPENBLAS_PARAM_M=8192
export OPENBLAS_PARAM_N=8192
export OPENBLAS_PARAM_K=8192
./sgemm

输出

From :   1  To : 200 Step=1 : Transa=N : Transb=NSIZE                   Flops             TimeM=8192, N=8192, K=8192 :    53485.68 MFlops 205.571220 sec

6.分别用CPU与OpenCL测试opencv resize的性能

A.编译OpenCV支持OpenCL

  • Opencv修改点[链接libmali.so]
diff --git a/cmake/OpenCVDetectOpenCL.cmake b/cmake/OpenCVDetectOpenCL.cmake
index 6ab2cae070..c3cf235e45 100644
--- a/cmake/OpenCVDetectOpenCL.cmake
+++ b/cmake/OpenCVDetectOpenCL.cmake
@@ -3,9 +3,8 @@ if(APPLE)set(OPENCL_LIBRARY "-framework OpenCL" CACHE STRING "OpenCL library")set(OPENCL_INCLUDE_DIR "" CACHE PATH "OpenCL include directory")else()
-  set(OPENCL_LIBRARY "" CACHE STRING "OpenCL library")
-  set(OPENCL_INCLUDE_DIR "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/1.2" CACHE PATH "OpenCL include directory")
-  ocv_install_3rdparty_licenses(opencl-headers "${OpenCV_SOURCE_DIR}/3rdparty/include/opencl/LICENSE.txt")
+  set(OPENCL_LIBRARY "/usr/lib/aarch64-linux-gnu/libmali.so")
+  set(OPENCL_INCLUDE_DIR "/usr/include")endif()mark_as_advanced(OPENCL_INCLUDE_DIR OPENCL_LIBRARY)
  • 编译Opencv
git clone https://github.com/opencv/opencv.git
cd opencv
git checkout bdb6a968ce69a2bf7c34724f9052c20e941ab47b
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release \-DCMAKE_INSTALL_PREFIX=`pwd`/_install \-DWITH_OPENCL=ON -DWITH_NEON=ON \-DBUILD_SHARED_LIBS=ON \-D BUILD_opencv_world=ON -DBUILD_TESTS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_opencv_apps=OFF \-DBUILD_opencv_dnn=OFF -DBUILD_opencv_calib3d=OFF \-DBUILD_opencv_imgproc=ON -DBUILD_opencv_imgcodecs=ON ..
make -j4
make install

B.运行OpenCV测试程序

cat > opencv_resize.cpp <<-'EOF'
#include <opencv2/opencv.hpp>
#include <opencv2/core/ocl.hpp>
#include <iostream>
#include <map>void run(int resize_mode)
{// 创建一个32x32的随机图像cv::Mat src = cv::Mat::zeros(32, 32, CV_8UC3);cv::randu(src, cv::Scalar::all(0), cv::Scalar::all(255));// ------------------------------------// 在CPU上执行// ------------------------------------cv::ocl::setUseOpenCL(false);cv::Mat enlarged_cpu, resized_back_cpu;// 记录放大操作的开始时间int64 start_time_cpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src, enlarged_cpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 缩小回32x32cv::resize(enlarged_cpu, resized_back_cpu, cv::Size(32, 32), 0, 0, resize_mode);}// 记录缩小操作的结束时间int64 end_time_cpu = cv::getTickCount();// 计算缩小操作的耗时double time_resize_cpu = (end_time_cpu - start_time_cpu) / cv::getTickFrequency();// ------------------------------------// 在GPU(OpenCL)上执行// ------------------------------------cv::ocl::setUseOpenCL(true);cv::UMat src_umat;src.copyTo(src_umat);cv::UMat enlarged_gpu, resized_back_gpu;// 记录放大操作的开始时间int64 start_time_gpu = cv::getTickCount();for(int i=0;i<100;i++){// 放大到8192x8192cv::resize(src_umat, enlarged_gpu, cv::Size(8192, 8192), 0, 0, resize_mode);// 缩小回32x32cv::resize(enlarged_gpu, resized_back_gpu, cv::Size(32, 32), 0, 0, resize_mode);}// 记录缩小操作的结束时间int64 end_time_gpu = cv::getTickCount();// 计算缩小操作的耗时double time_resize_gpu = (end_time_gpu - start_time_gpu) / cv::getTickFrequency();std::cout <<"CPU耗时(秒):" << time_resize_cpu << " " << "GPU耗时(秒):" << time_resize_gpu << std::endl;
}int main() {// 检查系统是否支持OpenCLif (!cv::ocl::haveOpenCL()) {std::cout << "系统不支持OpenCL。" << std::endl;return -1;}// 输出OpenCL设备信息cv::ocl::Context context;if (!context.create(cv::ocl::Device::TYPE_GPU)) {std::cout << "未找到可用的GPU设备,使用CPU执行。" << std::endl;} else {cv::ocl::Device device = cv::ocl::Device::getDefault();std::cout << "使用的OpenCL设备:" << device.name() << std::endl;}// 定义要测试的插值方法std::vector<int> interpolation_methods = {cv::INTER_NEAREST,cv::INTER_LINEAR,cv::INTER_CUBIC,cv::INTER_AREA,cv::INTER_LANCZOS4};// 插值方法的名称,用于输出结果std::vector<std::string> interpolation_names = {"INTER_NEAREST","INTER_LINEAR","INTER_CUBIC","INTER_AREA","INTER_LANCZOS4"};for (size_t i = 0; i < interpolation_methods.size(); ++i) {int interpolation = interpolation_methods[i];std::string method_name = interpolation_names[i];std::cout << "插值方法:" << method_name << " ";run(interpolation);}		return 0;
}
EOF
g++ -o opencv_resize opencv_resize.cpp -I _install/include/opencv4 \_install/lib/libopencv_world.so -Wl,-rpath=_install/lib
export OPENBLAS_NUM_THREADS=8
./opencv_resize

输出

arm_release_ver of this libmali is 'g6p0-01eac0', rk_so_ver is '6'.
使用的OpenCL设备:Mali-LODX r0p0
插值方法:INTER_NEAREST  CPU耗时():3.01526 GPU耗时():0.0672681
插值方法:INTER_LINEAR   CPU耗时():5.3227  GPU耗时():0.0189366
插值方法:INTER_CUBIC    CPU耗时():8.22734 GPU耗时():11.6337
插值方法:INTER_AREA     CPU耗时():20.4999 GPU耗时():27.3197
插值方法:INTER_LANCZOS4 CPU耗时():29.3602 GPU耗时():43.9484

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/web/65699.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

转运机器人在物流仓储行业的优势特点

在智能制造与智慧物流的浪潮中&#xff0c;一款革命性的产品正悄然改变着行业的面貌——富唯智能转运机器人&#xff0c;它以卓越的智能科技与创新的设计理念&#xff0c;引领着物流领域步入一个全新的高效、智能、无人的时代。 一、解放双手&#xff0c;重塑物流生态 富唯智能…

基于单片机的无线智能窗帘控制器的设计

摘 要 : 本文以单片机为控制核心 , 基于 PT2262/ 2272 无线收发模块 , 实现了窗帘的无线远程智能控制 . 该控制器通过高频无线收发模块实现了遥控窗帘的开合控制; 根据外部光线强弱实现自动开关窗帘 ; 根据设定时间自动完成开关过程; 通过语音播报当前环境温湿度信息以…

linux centos挂载未分配的磁盘空间

使用到的命令 lshw -class disk -short hostnamectl fdisk /dev/sdb partprobe /dev/sdb mount /dev/sdb2 /opt/fastdfs/ mkfs.ext4 /dev/sdb2 mount -t ext4 /dev/sdb2 /opt/fastdfs/

Vivado中Tri_mode_ethernet_mac的时序约束、分析、调整——(一)时序约束的基本概念

1、基本概念 推荐阅读&#xff0c;Ally Zhou编写的《Vivado使用误区与进阶》系列文章&#xff0c;熟悉基本概念、tcl语句的使用。 《Vivado使用误区与进阶》电子书开放下载&#xff01;&#xff01; 2、Vivado中的语法例程 1&#xff09;语法例程 约束的语句可以参考vivado…

基于Spring Boot的城市垃圾分类管理系统设计与实现(LW+源码+讲解)

专注于大学生项目实战开发,讲解,毕业答疑辅导&#xff0c;欢迎高校老师/同行前辈交流合作✌。 技术范围&#xff1a;SpringBoot、Vue、SSM、HLMT、小程序、Jsp、PHP、Nodejs、Python、爬虫、数据可视化、安卓app、大数据、物联网、机器学习等设计与开发。 主要内容&#xff1a;…

springboot整合admin

1. 添加依赖 首先&#xff0c;在你的admin服务端pom.xml文件中添加Spring Boot Admin的依赖&#xff1a; <dependency><groupId>de.codecentric</groupId><artifactId>spring-boot-admin-starter-server</artifactId><version>2.5.4<…

【YOLOv8杂草作物目标检测】

YOLOv8杂草目标检测 算法介绍模型和数据集下载 算法介绍 YOLOv8在禾本科杂草目标检测方面有显著的应用和效果。以下是一些关键信息的总结&#xff1a; 农作物幼苗与杂草检测系统&#xff1a;基于YOLOv8深度学习框架&#xff0c;通过2822张图片训练了一个目标检测模型&#xff…

比亚迪夏直插家用MPV腹地,“迪王”开启全面销冠新征程

文/王俣祺 导语&#xff1a;比亚迪前脚刚收获2024年的全面成功&#xff0c;后脚立刻就开始布局2025年的产品矩阵了。比亚迪夏的横空出世&#xff0c;看来家用MPV市场也要感受“迪王”的恐怖如斯了。 家用MPV市场的“意外之喜” 1月8日&#xff0c;比亚迪夏终于在万众瞩目之下…

左值引用(Lvalue Reference)和右值引用(Rvalue Reference)详解

左值引用&#xff08;Lvalue Reference&#xff09;和右值引用&#xff08;Rvalue Reference&#xff09;详解 文章目录 左值引用&#xff08;Lvalue Reference&#xff09;和右值引用&#xff08;Rvalue Reference&#xff09;详解1. 什么是左值和右值&#xff1f;左值&#x…

探索数据存储的奥秘:深入理解B树与B+树

key value 类型的数据红黑树&#xff08;最优二叉树&#xff0c;内存最优&#xff09;&#xff0c;时间复杂度&#xff1a;O&#xff08;logn&#xff09;,调整方便&#xff1b;一个结点分出两个叉B树一个节点可以分出很多叉数据量相等的条件下&#xff1a;红黑树的层数很高&am…

联邦大语言模型典型系统: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHub

联邦大语言模型典型系统: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHub 目录 联邦大语言模型典型系统: FATE - LLM、FedLLM、FederatedScope - LLM、PrimiHubPEFT 技术及简单举例PEFT 技术代码实现提示词工程不仅仅在聊天对话框实现,还可以再代码中实现联邦大语言模…

L1G5000 XTuner 微调个人小助手认知

使用 XTuner 微调 InternLM2-Chat-7B 实现自己的小助手认知 1 环境配置与数据准备步骤 0. 使用 conda 先构建一个 Python-3.10 的虚拟环境步骤 1. 安装 XTuner 修改提供的数据步骤 0. 创建一个新的文件夹用于存储微调数据步骤 1. 创建修改脚本步骤 2. 执行脚本步骤 3. 查看数据…

网络协议安全的攻击手法

1.使用SYN Flood泛洪攻击&#xff1a; SYN Flood(半开放攻击)是最经典的ddos攻击之一&#xff0c;他利用了TCP协议的三次握手机制&#xff0c;攻击者通常利用工具或控制僵尸主机向服务器发送海量的变源端口的TCP SYN报文&#xff0c;服务器响应了这些报文后就会生成大量的半连…

人工智能学习路线全链路解析

一、基础准备阶段&#xff08;预计 2-3 个月&#xff09; &#xff08;一&#xff09;数学知识巩固与深化 线性代数&#xff08;约 1 个月&#xff09;&#xff1a; 矩阵基础&#xff1a;回顾矩阵的定义、表示方法、矩阵的基本运算&#xff08;加法、减法、乘法&#xff09;&…

Redis 安装与 Spring Boot 集成指南

安装 Redis 和将其与 Spring Boot 应用集成是构建高效缓存解决方案的常见步骤。以下是详细的指南&#xff0c;帮助你在本地环境中安装 Redis&#xff0c;并在 Spring Boot 项目中配置和使用它。 1. 安装 Redis Windows 环境 Redis 官方并不直接支持 Windows&#xff0c;但你…

3d打印材料是塑料么?pla petg

3D 打印材料不仅限于塑料&#xff0c;但塑料确实是最常见的材料类型之一。以下是一些常用的3D打印塑料材料的介绍&#xff1a; 1. PLA&#xff08;聚乳酸&#xff09; • 特点&#xff1a;可生物降解&#xff0c;环保&#xff0c;容易打印&#xff0c;表面光滑。 • 适用…

linux-磁盘io性能指标!

一. 引文&#xff1a; 平时查看或者监控磁盘io时&#xff0c;基本上都是用的现成的工具/脚本&#xff0c; 对其了解的还是很浅&#xff0c;特参考一些资料整理了下&#xff0c;留个随笔。 二.磁盘I/O性能指标: 磁盘 I/O 是 Unix/Linux 系统管理中一个非常重要的组成部分。磁盘…

Excel 技巧08 - 如何计算某类(比如红色背景色)单元格的总和? (★)

本文讲了如何在Excel中计算某类(比如红色背景色)单元格的总和。 1&#xff0c;如何计算某类(比如红色背景色)单元格的总和&#xff1f; 技巧就是先把它们给标记出来&#xff0c;然后就好统计了。 那么如何找出来呢&#xff1f; 对&#xff0c;就是通过红色。 按下Ctrl F 点…

awr报告无法生成:常见分析手段

awr报告无法生成:常见分析手段 STATISTICS_LEVEL和OPEN_MODEAWR快照是否能自动生成?AWR快照能否手动生成?日志有无ORA-12751或ORA-32701报错?MMON进程是否被挂起?排查数据库隐藏参数分析快照生成错误信息分析AWR Snapshot Tracing分析AWR Table Flush是否超时STATISTICS_L…

uni-app无限级树形组件简单实现

因为项目一些数据需要树形展示&#xff0c;但是官网组件没有。现在简单封装一个组件在app中使用&#xff0c;可以无线嵌套&#xff0c;展开&#xff0c;收缩&#xff0c;获取子节点数据等。 简单效果 组件TreeData <template><view class"tree"><te…