Test Environment

  • GPU: NVIDIA Tesla A100
  • CUDA Version: 12.6

Results

[M, N, K] [kTM, kTN, kTK] WarpLayout kRK CUTLASS(ms) TileFusion(ms)
[1024, 1024, 512] [64, 128, 128] [2, 2] 16 0.017591 0.016548
[1024, 1024, 1024] [64, 128, 128] [2, 2] 16 0.029245 0.027156
[2048, 2048, 1024] [64, 128, 128] [2, 2] 16 0.065372 0.070431
[2048, 2048, 2048] [64, 128, 128] [2, 2] 16 0.101253 0.128143
[4096, 4096, 4096] [64, 128 128] [2, 2] 16 0.818606 0.969605
[8192, 8192, 1024] [64, 128 ,128] [2, 2] 16 0.871526 0.971059
[8192, 8192, 2048] [64, 128, 128] [2, 2] 16 1.937879 1.931223
[8192, 8192, 4096] [64, 128, 128] [2, 2] 16 3.924275 3.956757
[8192, 8192, 8192] [64, 128, 128] [2, 2] 16 7.740396 8.080589