This preliminary test evaluates the performance of transferring a row-major data tile containing half-precision floating-point values between global memory and shared memory. The transfer process involves loading the data tile into shared memory and subsequently storing it back to global memory. This cycle is repeated 100 times to measure performance.

Performance is assessed based on the total time required to complete the 100 data tile transfers.

Implementations

The test includes implementations using TileFusion and cutlass, with no bank conflicts observed in the NVIDIA Compute Utility. The cutlass implementation utilizes a copy plan that allows for maximal global memory coalescing to optimally utilize the global memory.

Test Environment

  • GPU: NVIDIA Tesla A100
  • CUDA Version: 12.6

Results

Shape Warp Layout tilefusion(ms) cutlass(ms) Ratio
RowMajor(16, 64) (1, 1) 0.02996 0.02957 1.013
RowMajor(64, 64) (1, 1) 0.05073 0.05071 1
RowMajor(64, 64) (2, 1) 0.05045 0.05068 0.9956
RowMajor(64, 64) (4, 1) 0.05119 0.05145 0.995
RowMajor(128, 128) (1, 1) 0.1369 0.154 0.8888
RowMajor(128, 128) (2, 2) 0.1374 0.134 1.025
RowMajor(128, 128) (4, 2) 0.138 0.1382 0.9984
RowMajor(128, 256) (1, 1) 0.2464 0.3694 0.6671
RowMajor(128, 256) (2, 2) 0.2471 0.2458 1.005
RowMajor(128, 256) (2, 4) 0.2592 0.2511 1.032
RowMajor(128, 256) (4, 4) 0.2543 0.2572 0.9889