CUDA 矩阵转置优化

打算开个新坑来填一下 GPU 上常用的优化技巧和算法，和 CPU 上不一样，GPU 上由于体系结构的特殊性导致算法的设计必须要考虑到硬件构造，即便是同样复杂度的算法，不同的实现之间通常会差几十甚至上百倍的时间。网络上其实已经有很多关于 CUDA 优化的资料，但我可能更看重于探索不同的做法。作为这个系列的开始，就拿最简单的矩阵转置来一探究竟吧！

问题描述

给定一个以行优先方式存储的 $M \times N$ 的矩阵 $A$ （即 $A_{i, j}$ 存储在 $A_\text{mem}[i \times N + j]$ 上），需要得到一个同样是以行优先存储的 $N \times M$ 的转置矩阵 $A^\intercal$ （即 $A^\intercal_{i, j}$ 存储在 $A^{\intercal}_{\text{mem}}[i \times M + j]$ 上）。

评测环境与测试结果

下面用于评测每个 Kernel 时间的测评环境是这样的：

CPU: AMD Ryzen 9 3900x
GPU: RTX 2080 Ti (11 GiB) @ 250W
OS: Arch Linux
GPU Driver: 545.29.02
GCC/CUDA: 13.2.1/12.3.0

所有实验均选取 $M = N = 16384$ 作为输入矩阵大小，并采用 3 轮热身，评测 100 轮取平均值和标准差的方式得到运行时间。下表展示了所有 Kernel 的测试结果，仅供娱乐😝：

实现	线程块形状 $(x, y)$	运行时间 (ms)	相对效率¹
内存拷贝²	-	4.062 ± 0.041	100%
cuBLAS 转置	-	4.150 ± 0.049	97.9%
实现一：暴力	(32, 8)	9.379 ± 0.121	43.3%
实现二：使用 Shared Memory 优化访存	(32, 8)	5.300 ± 0.107	76.6%
实现三：减少 Bank Conflicts	(32, 8)	4.986 ± 0.035	81.5%
实现四：使用 Double-Buffering 技术优化	(32, 32)	4.205 ± 0.058	96.6%
实现五：增大矩阵分块的大小以减少同步数量	(32, 32)	4.089 ± 0.064	99.3%

实现一：暴力

最暴力的方式就是直接让开 $N \times M$ 个线程，然后让线程 $(x, y)$ 负责对 $A^\intercal_{y, x}$ 找到其对应的值 $A_{x, y}$ ，代码如下：

transpose_v1
template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void transpose_v1(int M, int N,                  //
                             const float *__restrict__ iA,  //
                             float *__restrict__ oA) {
  const int y = blockIdx.y * BLOCK_DIM_Y + threadIdx.y;
  const int x = blockIdx.x * BLOCK_DIM_X + threadIdx.x;
  if (y >= N || x >= M) {
    return;
  }
  oA[y * M + x] = iA[x * N + y];
}

其中 BLOCK_DIM_X 和 BLOCK_DIM_Y 表示线程块的形状。测试结果显示这个 Kernel 的效率不到内存拷贝的一半，说明是很有提速空间的，造成这个现象的主要原因是写入 Global Memory 时内存访问不连续，例如：

线程 (0, 0) 读取 iA[0 * N + 0] 写入 oA[0 * M + 0]
线程 (1, 0) 读取 iA[1 * N + 0] 写入 oA[0 * M + 1]
线程 (2, 0) 读取 iA[2 * N + 0] 写入 oA[0 * M + 2]
……

可以看到相邻线程间读取的 iA 地址的步长是 $N$ ，因此只要是 $N$ 足够大，上面这些内存访问就很难被合并（Coalesced）也很难被缓存 Cache 住。另外一种实现是连续的线程读取连续的 iA，但这样会造成写入 oA 的内存访问不连续，因此也会造成访存不连续的问题。

实现二：使用 Shared Memory 优化访存

通过上述分析不难看出优化的关键在与如何能合并写回到 Global Memory 的访存，这时候就要搬出 GPU 上一块比较好玩的硬件——Shared Memory，Shared Memory 并没有类似于 Global Memory 中关于合并内存访问的优化，并且 Shared Memory 比 Global Memory 快几个数量级，因此将 Shared Memory 用于处理这种改变数据在内存中排列的问题就尤为合适。

具体而言，我们先将整个输出矩阵按照 $T_y \times T_x$ 的大小分块，其中 $T_x$ 和 $T_y$ 分别表示每个块的长和宽，然后每个 CUDA 线程块负责处理一个分块，其算法分为三步：

线程块以内存友好的形式读入 iA 到 Shared Memory 中；
同步线程块内的所有线程；
线程块以内存友好的形式从 Shared Memory 中写入到 oA 中；

正如前面提到的，Shared Memory 的访问并不关心其访问是否连续（而只关心是否有 Bank Conflict，见下一节），因此我们可以在保证对 Global Memory 上的 iA 或者 oA 的内存访问是友好的（即可以被合并）。我的实现里选定 $T_x = T_y = 32$ （因为 32 个 float32 刚好 128 字节刚好是最大的能合并的内存访问大小）通过上述步骤得到的代码如下：

transpose_v2_32x32template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void transpose_v2_32x32(int M, int N,                  //
                                   const float *__restrict__ iA,  //
                                   float *__restrict__ oA) {
  __shared__ float smem[32][32];
  // Global Memory -> Shared Memory
  {
    constexpr const int ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const int ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.x * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.y * 32 + lx;
        if (gy < M && gx < N) {
          smem[lx][ly] = iA[gy * N + gx];
        }
      }
    }
  }
  __syncthreads();

  // Shared Memory -> Global Memory
  {
    constexpr const auto ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const auto ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.y * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.x * 32 + lx;
        if (gy < N && gx < M) {
          oA[gy * M + gx] = smem[ly][lx];
        }
      }
    }
  }
}

测试结果中展示了这种方法的效果，在采用相同线程块大小 (32, 8) 的情况下，比实现一快了不少，但是距离 cuBLAS 仍然还有距离。用 Nsight Compute 跑一下发现果然是 Bank Conflict 在作怪，因此我们要需要好好考虑 Shared Memory 中的排列以减少 Bank Conflict。

实现三：优化 Bank Conflicts

Shared Memory 被划分成了 32 个 Bank，如果线程块中同半个 Warp 中的不同线程访问了 Shared Memory 同一个 Bank 中的不同内存，那么就会导致 Bank Conflict。这通常是造成 Shared Memory 性能减弱的一个关键因素，然而不幸的是在读取 iA 时我们刚好需要让一个 Warp 访问 Shared Memory 中的一列🫤。

大概许多人优化 Bank Conflicts 的方式都是在 Shared Memory 后面补一维，即虽然我们用到的是 32 × 32 的共享内存，但是我们声明的时候声明成 32 × 33，这样同一列的不同元素就会分布在不同 Bank 上。但这种方式让强qiǎng迫症的我很是不爽，这里其实并不需要 Padding，可以采用一种类似于半精度矩阵乘法中 Shared Memory 的数据排列的方法来存储数据，代码如下：

transpose_v3_32x32template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void transpose_v3_32x32(int M, int N,                  //
                                   const float *__restrict__ iA,  //
                                   float *__restrict__ oA) {
  __shared__ float smem[32][32];
  // Global Memory -> Shared Memory
  {
    constexpr const int ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const int ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.x * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.y * 32 + lx;
        if (gy < M && gx < N) {
          smem[lx][(lx + ly) % 32] = iA[gy * N + gx];
        }
      }
    }
  }
  __syncthreads();

  // Shared Memory -> Global Memory
  {
    constexpr const auto ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const auto ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.y * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.x * 32 + lx;
        if (gy < N && gx < M) {
          oA[gy * M + gx] = smem[ly][(lx + ly) % 32];
        }
      }
    }
  }
}

代码中的高亮行是和实现二最大的区别，通过 Nsight Compute 跑出来证实这样可以完美避免所有 Bank Conflict，同时也不需要 Padding，可以说是一石二鸟。

测试结果表明在线程块大小相同的情况下，这种方法确实带来了一定的性能提升但不多，也达不到 cuBLAS 和直接进行内存拷贝的效率。从这里开始其实用 Nsight Compute 也很难一眼看出性能瓶颈在哪，例如 SASS 汇编中发现造成 Warp Stall 最主要的原因是 Long Scoreboard，大致应该是和全局内存读写有关？然而从这个角度出发，除了向量化内存读写或许就没有什么好办法了，但向量化内存读写会改变每个线程做的事情，可能会引入新的 Shared Memory Bank Conflict。大致观察了一下前人的经验，得出的结论大致是每个线程块同步太多但做事太少而造成的。而需要优化这个大致有两种思路，分别对应实现四和实现五，在相同共享内存大小的前提下，这两种方法是互斥的，因此不能同时使用。

实现四：使用 Double-Buffering 技术优化

一种很容易直接想到的办法是让一个线程块处理多个矩阵分块，但直接这样做处理多少个矩阵分块就需要多少次同步，因此直观上对性能的影响应该不大。为了减少同步可以采用一种叫做 Double-Buffering 的技术进行异步的内存读写，具体而言就是开两块 Shared Memory 在从第一块 Shared Memory 写入当前分块到 Global Memory 时，同时读取下一个分块到第二块 Shared Memory 中，代码如下：

transpose_v4_32x32template <int ITER_BLOCK_X, int ITER_BLOCK_Y,  //
          int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void transpose_v4_32x32(int M, int N,                  //
                                   const float *__restrict__ iA,  //
                                   float *__restrict__ oA) {
  __shared__ float smem[2][32][32];

  auto MoveG2S = [&](int i) {
    constexpr const int ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const int ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

    const int by = i / ITER_BLOCK_X * gridDim.y + blockIdx.y;
    const int bx = i % ITER_BLOCK_X * gridDim.x + blockIdx.x;

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = bx * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = by * 32 + lx;
        if (gy < M && gx < N) {
          smem[i & 1][lx][(lx + ly) % 32] = iA[gy * N + gx];
        }
      }
    }
  };

  auto MoveS2G = [&](int i) {
    constexpr const auto ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const auto ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

    const int by = i / ITER_BLOCK_X * gridDim.y + blockIdx.y;
    const int bx = i % ITER_BLOCK_X * gridDim.x + blockIdx.x;

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = by * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = bx * 32 + lx;
        if (gy < N && gx < M) {
          oA[gy * M + gx] = smem[i & 1][ly][(lx + ly) % 32];
        }
      }
    }
  };

  MoveG2S(0);
#pragma unroll
  for (int i = 0; i < ITER_BLOCK_Y * ITER_BLOCK_X; i++) {
    __syncthreads();
    if (i + 1 < ITER_BLOCK_Y * ITER_BLOCK_X) {
      MoveG2S(i + 1);
    }
    MoveS2G(i);
  }
}

测试结果表明和最优的内存拷贝只差了一点点，已经非常优秀了。

实现五：增大矩阵分块的大小以减少同步数量

因为每个矩阵分块只需要同步一次，因此另一种减少线程块同步的方法是让每个线程块处理 64 × 32 大小的矩阵分块，具体代码如下：

transpose_v5_64x32template <int BLOCK_DIM_X, int BLOCK_DIM_Y>
__global__ void transpose_v5_64x32(int M, int N,                  //
                                   const float *__restrict__ iA,  //
                                   float *__restrict__ oA) {
  __shared__ float smem[64][32];
  // Global Memory -> Shared Memory
  {
    constexpr const int ITER_Y = 32 / BLOCK_DIM_Y;
    constexpr const int ITER_X = 64 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 32);
    static_assert(ITER_X * BLOCK_DIM_X == 64);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.x * 32 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.y * 64 + lx;
        if (gy < M && gx < N) {
          smem[lx][(lx + ly) % 32] = iA[gy * N + gx];
        }
      }
    }
  }
  __syncthreads();

  // Shared Memory -> Global Memory
  {
    constexpr const auto ITER_Y = 64 / BLOCK_DIM_Y;
    constexpr const auto ITER_X = 32 / BLOCK_DIM_X;
    static_assert(ITER_Y * BLOCK_DIM_Y == 64);
    static_assert(ITER_X * BLOCK_DIM_X == 32);

#pragma unroll
    for (int iy = 0; iy < ITER_Y; iy++) {
      const int ly = iy * BLOCK_DIM_Y + threadIdx.x / BLOCK_DIM_X;
      const int gy = blockIdx.y * 64 + ly;
#pragma unroll
      for (int ix = 0; ix < ITER_X; ix++) {
        const int lx = ix * BLOCK_DIM_X + threadIdx.x % BLOCK_DIM_X;
        const int gx = blockIdx.x * 32 + lx;
        if (gy < N && gx < M) {
          oA[gy * M + gx] = smem[ly][(lx + ly) % 32];
        }
      }
    }
  }
}

测试结果表明这种方式能基本达到内存拷贝性能，也能超过 cuBLAS。

总结

矩阵转置的优化可以说是 CUDA 优化中基础中的基础了，然而尽管如此我还是花了好几天来学习和测试各种不同的实现。但是翻遍网络上的所有资料，大多只告诉你一种可行方法而并不会列举出其他的可能性。另外就是个人感觉 CUDA 代码的优化非常吃经验，很多时候就只能靠不断的做性能测试才能知道哪些决策是正确的。

参考

相对效率的计算方式为内存拷贝的运行时间除以当前 Kernel 运行时间 ↩︎
即直接将 $A_\text{mem}$ 的内存拷贝到 $A_{\text{mem}}^\intercal$ ↩︎