CUDA从入门到精通(十):性能剖析和Visual Profiler

入门后的进一步学习的内容,就是如何优化自己的代码。我们前面的例子没有考虑任何性能方面优化,是为了更好地学习基本知识点,而不是其他细节问题。从本节开始,我们要从性能出发考虑问题,不断优化代码,使执行速度提高是并行处理的唯一目的。

测试代码运行速度有很多方法,C语言里提供了类似于SystemTime()这样的API获得系统时间,然后计算两个事件之间的时长从而完成计时功能。在CUDA中,我们有专门测量设备运行时间的API,下面一一介绍。

翻开编程手册《CUDA_Toolkit_Reference_Manual》,随时准备查询不懂得API。我们在运行核函数前后,做如下操作:

cudaEvent_t start,stop;//事件对象cudaEventCreate(&start);//创建事件cudaEventCreate(&stop);//创建事件cudaEventRecord(start,stream);//记录开始myKernel<<<dimg,dimb,size_smem,stream>>>(parameter list);//执行核函数cudaEventRecord(stop,stream);//记录结束事件cudaEventSynchronize(stop);//事件同步,等待结束事件之前的设备操作均已完成float elapsedTime;cudaEventElapsedTime(&elapsedTime,start,stop);//计算两个事件之间时长(单位为ms)

核函数执行时间将被保存在变量elapsedTime中。通过这个值我们可以评估算法的性能。下面给一个例子,来看怎么使用计时功能。

前面的例子规模很小,只有5个元素,处理量太小不足以计时,下面将规模扩大为1024,此外将反复运行1000次计算总时间,这样估计不容易受随机扰动影响。我们通过这个例子对比线程并行和块并行的性能如何。代码如下:

#include "cuda_runtime.h"#include "device_launch_parameters.h"#include <stdio.h>cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size);__global__ void addKernel_blk(int *c, const int *a, const int *b){int i = blockIdx.x;c[i] = a[i]+ b[i];}__global__ void addKernel_thd(int *c, const int *a, const int *b){int i = threadIdx.x;c[i] = a[i]+ b[i];}int main(){const int arraySize = 1024;int a[arraySize] = {0};int b[arraySize] = {0};for(int i = 0;i<arraySize;i++){a[i] = i;b[i] = arraySize-i;}int c[arraySize] = {0};// Add vectors in parallel.cudaError_t cudaStatus;int num = 0;cudaDeviceProp prop;cudaStatus = cudaGetDeviceCount(&num);for(int i = 0;i<num;i++){cudaGetDeviceProperties(&prop,i);}cudaStatus = addWithCuda(c, a, b, arraySize);if (cudaStatus != cudaSuccess) {fprintf(stderr, "addWithCuda failed!");return 1;}// cudaThreadExit must be called before exiting in order for profiling and// tracing tools such as Nsight and Visual Profiler to show complete traces.cudaStatus = cudaThreadExit();if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaThreadExit failed!");return 1;}for(int i = 0;i<arraySize;i++){if(c[i] != (a[i]+b[i])){printf("Error in %d\n",i);}}return 0;}// Helper function for using CUDA to add vectors in parallel.cudaError_t addWithCuda(int *c, const int *a, const int *b, size_t size){int *dev_a = 0;int *dev_b = 0;int *dev_c = 0;cudaError_t cudaStatus;// Choose which GPU to run on, change this on a multi-GPU system.cudaStatus = cudaSetDevice(0);if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");goto Error;}// Allocate GPU buffers for three vectors (two input, one output) .cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMalloc failed!");goto Error;}cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMalloc failed!");goto Error;}cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMalloc failed!");goto Error;}// Copy input vectors from host memory to GPU buffers.cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMemcpy failed!");goto Error;}cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMemcpy failed!");goto Error;}cudaEvent_t start,stop;cudaEventCreate(&start);cudaEventCreate(&stop);cudaEventRecord(start,0);for(int i = 0;i<1000;i++){//addKernel_blk<<<size,1>>>(dev_c, dev_a, dev_b);addKernel_thd<<<1,size>>>(dev_c, dev_a, dev_b);}cudaEventRecord(stop,0);cudaEventSynchronize(stop);float tm;cudaEventElapsedTime(&tm,start,stop);printf("GPU Elapsed time:%.6f ms.\n",tm);// cudaThreadSynchronize waits for the kernel to finish, and returns// any errors encountered during the launch.cudaStatus = cudaThreadSynchronize();if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaThreadSynchronize returned error code %d after launching addKernel!\n", cudaStatus);goto Error;}// Copy output vector from GPU buffer to host memory.cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);if (cudaStatus != cudaSuccess) {fprintf(stderr, "cudaMemcpy failed!");goto Error;}Error:cudaFree(dev_c);cudaFree(dev_a);cudaFree(dev_b);return cudaStatus;}

addKernel_blk是采用块并行实现的向量相加操作,而addKernel_thd是采用线程并行实现的向量相加操作。分别运行,得到的结果如下图所示:

线程并行:

每一件事都要用多方面的角度来看它

CUDA从入门到精通(十):性能剖析和Visual Profiler

相关文章:

你感兴趣的文章:

标签云: