By 0x7df, Tue 21 April 2015, in category Computer science
Recently, I posted a basic introduction to CUDA C for programming GPUs, which showed how to do a vector addition. This illustrated some of the CUDA basic syntax, but it wasn't a complex- enough example to bring to light some of the trickier issues to do with designing algorithms carefully to minimise data movement. Here we move on to the more complicated algorithm for matrix multiplication, C = AB, where we'll see that elements of the matrices get used multiple times, so we'll want to put them in the shared memory to minimise the number of times they get retrieved from the much slower global (or device) memory. We'll also see that, because data that a thread puts into shared memory is only accessible by the other threads in the same thread block, we need to be careful how we do this.
First, let's ignore those concerns and put together the simplest implementation of matrix multiplication; then we'll analyse the memory access, and see how we can improve on it.
Before we begin, however, some-error checking. Below is a function-like
C macro that will be used to surround each CUDA statement we execute
with a check of the return code. The return code is set to the
pre-defined variable cudaSuccess
if the statement executed
successfully, or an error value otherwise. (Hence, we declare the
variable that will contain the CUDA statement return to be type
cudaError_t
.) Where an error value is returned, we pass this to the
CUDA function cudaGetErrorString
, which returns an error message that
we can print.
\#define cudaCheck(stmt)
\\
do \\
{ \\
cudaError_t err = stmt;
\\
if (err != cudaSuccess) \\
{ \\
printf("ERROR: failed to run %s\\n", stmt); \\
printf("ERROR: CUDA error %s\\n", cudaGetErrorString(err)); \\
return -1; \\
} \\
} while (0)
Now for the kernel function. The way we've chosen to divide this problem up amongst threads is to have each thread calculate a single element in the output vector, C. Mathematically, for an m-by-n matrix A and an n-by-p matrix B, this is:
for each of the m-by-p elements in C. This is illustrated in the figure, where the input matrices A and B are shown in grey, and the result, matrix C, in blue; a single element of C is highlighted in red, and the corresponding row and column of A and B are also highlighted.
We implement this in CUDA C as follows:
__global__ void matrixMultiply(float *A, float *B,
float *C, int numACols,
int numBRows, int numBCols,
int numCRows, int numCCols)
{
// Get the row and column indices of the single
// element of the output matrix that this thread
// is dealing with
int col = threadIdx.x + blockDim.x*blockIdx.x;
int row = threadIdx.y + blockDim.y*blockIdx.y;
// Calculate the output matrix element
if ((row < numCRows) && (col < numCCols))
{
float Ctmp = 0;
for (int k = 0; k < numACols; ++k)
{
Ctmp += A[row*numACols+k]*B[k*numBCols+col];
}
C[row*numCCols + col] = Ctmp;
}
}
This is reasonably simple. Each thread figures out which output matrix element it is responsible for, simply by checking the thread indices. It proceeds only if the element indices are within the correct bounds of the output matrix, which may not be the case if there are more threads than elements (because we have to have a whole number of thread blocks). Where they are, it retrieves the correct row of A and column of B, and calculates the corresponding single element of C.
For completeness, here is the host code. The new things here that we didn't see in the vector multiplication example are:
cudaCheck
(defined above) for error checkingThe call to cudaDeviceSynchronize()
:::c int main(int argc, char **argv) {
float hostA, hostB, hostC;
float deviceA, deviceB, deviceC;
int numARows, numACols; // Rows, columns in the matrix A
int numBRows, numBCols; // Rows, columns in the matrix B
int numCRows, numCCols; // Rows, columns in the matrix C
int sizeA, sizeB, sizeC; // Size in memory of each of A, B and C
int gridXSize, gridYSize; // Number of thread blocks in x, y dimensions of grid
int blockSize; // Number of threads in block
// Allocate and populate the A and B matrices
// hostA and hostB, and get numARows, numACols,
// numBRows, numBCols
// Set numCRows and numCCols
numCRows = numARows;
numCCols = numBCols;
// Allocate the C matrix
hostC = (float)malloc(numCRowsnumCColssizeof(float));
// Allocate GPU memory
sizeA = numARowsnumAColssizeof(float);
sizeB = numBRowsnumBColssizeof(float);
sizeC = numCRowsnumCColssizeof(float);
cudaCheck(cudaMalloc((void ) &deviceA, sizeA));
cudaCheck(cudaMalloc((void ) &deviceB, sizeB));
cudaCheck(cudaMalloc((void *) &deviceC, sizeC));
// Copy data to the GPU
cudaCheck(cudaMemcpy(deviceA, hostA, sizeA,
cudaMemcpyHostToDevice));
cudaCheck(cudaMemcpy(deviceB, hostB, sizeB,
cudaMemcpyHostToDevice));
// Initialize the grid and block dimensions
blockSize = 16;
gridXSize = (numCCols-1)/blockSize + 1;
gridYSize = (numCRows-1)/blockSize + 1;
dim3 dimGrid(gridXSize, gridYSize, 1);
dim3 dimBlock(blockSize, blockSize, 1);
// Launch the GPU Kernel
matrixMultiply<<
deviceC, numACols,
numBRows, numBCols,
numCRows, numCCols);
cudaDeviceSynchronize();
// Copy the GPU memory back to the CPU
cudaCheck(cudaMemcpy(hostC, deviceC, sizeC,
cudaMemcpyDeviceToHost));
// Free the GPU memory
cudaCheck(cudaFree(deviceA));
cudaCheck(cudaFree(deviceB));
cudaCheck(cudaFree(deviceC));
// Do something with the solution, free the host memory, return
}
The call to cudaDeviceSynchronize()
ensures that all threads have
finished before the host code proceeds any further.
Clearly, each of the \(mp\) elements of \(C\) requires a full row of \(A\) and a full column of \(B\) - both of length \(n\) - to be read from memory, and one value to be written back. Hence there are \((2n + 1)mp\) memory accesses. Re-examining the kernel, we see that there are two floating point operations per iteration of the inner loop (one multiply and one add), and \(n\) iterations of that loop, which is completed for each of the \(mp\) elements in the product matrix. Hence, there are \(2nmp\) FLOP, and the CGMA is \(2n/(2n + 1)\); which is effectively 1, except when the matrices are very small. With a memory bandwidth of 150 GB/s, the algorithm is limited to just under 150/8 = 20 GFLOP/s (assuming double precision), which is still less than 2% of the available compute of our nominal 1 TFLOP GPU.
However, it turns out that we can improve on this. So far, all the data storage has been in global memory, because that's the only permissible location for CUDA memory allocations in the host code, and that's where the data stays unless we explicitly move it, once inside the kernel function (we'll see how later). It's also clear that in this algorithm data gets re-used frequently. Every row of matrix \(A\) is used \(p\) times and every column of matrix \(B\) is used \(m\) times. If we contrive an algorithm that gets the necessary data into shared memory before it is needed, and keeps it there while it is being re-used, then we can clearly reduce the global memory accesses.
However, it's not as though we can read \(A\) and \(B\) into shared memory and have them accessible to all the threads working on the computation; shared memory isn't globally accessible, despite the name, but is instead local to a single streaming multiprocessor, and only 'shared' amongst the threads in whichever thread block is currently assigned to the SM. Hence our goal is to ensure that the threads in a given thread block have the subset of input data they need available in their SM's shared memory, under the general assumption that because of the small size of the shared memory, not all of the needed data will fit in at once.
Consider a thread block covering an area of the product matrix \(C\), which is \(a\) rows high by \(a\) columns wide, with the top-left element being \(i\), \(j\) and the bottom-right therefore being \(i+a, j+a\). This is shown in the figure. To compute these values, the rows \(i, i+1, ..., i+a\) of matrix \(A\) and columns \(j, j+1, ..., j+a\) of matrix \(B\) are required, comprising horizontal and vertical strips, respectively, of dimension \(a \times n\) elements. We assume in general these strips comprise too much data to move all together to shared memory. Instead, we move a block of elements from the strip of \(A\), and a block of elements from the strip of \(B\) - i.e. two blocks of size \(a \times a\), one from each matrix; we will refer to these as tiles. Performing matrix multiplication on these two tiles creates a tile of partial sums in the \(C\) elements. When the next pair of tiles from \(A\) and \(B\) are retrieved, the partial sums are further incremented, until eventually the full strips have been processed and the final answers are available.
There is still some duplication of global memory accesses, because any given strip of \(A\) will be required by all the thread blocks of the \(C\) matrix that share the same row indices; and any given strip of \(B\) will be required by all the thread blocks of the \(C\) matrix that share the same column indices. However, we can see that there is at least some re-use of data in shared memory; each sub-row of the tile from \(A\) gets re-used \(a\) times (for the \(a\) elements of the output matrix that have the same row index), as does each sub-column of the tile from \(B\). This data re-use reduces the retrievals from global memory by a factor of \(a\).
Here is the kernel for tiled matrix multiplication.
__global__ void matrixMultiply(float *A, float *B, float *C,
int numARows, int numACols,
int numBRows, int numBCols,
int numCRows, int numCCols) {
// Define device shared-memory storage for
// tiles of the matrices
// Scope: each tile is accessible by a single
// block of threads
__shared__ float tileA[TILE_WIDTH][TILE_WIDTH];
__shared__ float tileB[TILE_WIDTH][TILE_WIDTH];
// Define abbreviated variables for the
// block and thread IDs
// Scope: stored in registers and therefore
// accessible by single threads
int bx = blockIdx.x;
int by = blockIdx.y;
int tx = threadIdx.x;
int ty = threadIdx.y;
// Each thread is responsible for a single
// element of the product matrix C.
// Determine which element, from the block
// and thread indices
int row = by*TILE_WIDTH + ty;
int col = bx*TILE_WIDTH + tx;
// Initialise a temp variable for the solution
// for this matrix element
// Scope: in register, private to individual thread
float Ctemp = 0;
// Loop over the tiles in the A and B matrices
// that will contribute to the calculation of
// this element in the product matrix. We are
// looping over columns of A for a given row
// (equal to the row index of the C element),
// and over rows of the B matrix for a given
// column index (equal to the column index of
// the C element)
int numTiles = (numACols-1)/TILE_WIDTH + 1;
for (int tl = 0; tl < numTiles; ++tl) {
// Load the tiles into shared memory, so all
// threads in the block have access to the
// whole tiles. Each thread needs to load only
// a single value of each of the A and B tiles.
if ((row < numARows) && (tl*TILE_WIDTH + tx < numACols)) {
tileA[ty][tx] = A[row*numACols + tl*TILE_WIDTH + tx];
} else {
tileA[ty][tx] = 0.;
}
if ((tl*TILE_WIDTH + ty < numBRows) && (col < numBCols)) {
tileB[ty][tx] = B[(tl*TILE_WIDTH + ty)*numBCols + col];
} else {
tileB[ty][tx] = 0.;
}
__syncthreads();
// Loop over the elements within the A and B
// tiles that contribute to this element of C
for (int k = 0; k < TILE_WIDTH; ++k) {
Ctemp += tileA[ty][k] * tileB[k][tx];
}
__syncthreads();
}
// Write the final value into the output array
if ((row < numARows) && (col < numBCols)) {
C[row*numBCols + col] = Ctemp;
}
}
In each thread block, the \(a^2\) threads load two float values each and perform \(2a\) floating-point operations to compute the dot product of the row and column sub-sections (both of length \(a\)) required for the single output matrix element it holds. Hence there are \(2a\) computations for two memory loads, which gives a CGMA ratio of \(a\). For the naive implementation it was 1, so we have improved the CGMA by a factor of \(a\) by tiling the data.
There are a few other things to note in the kernel.
__shared__
identifier in the allocations statements
for tileA
and tileB
(which are the temporary storage arrays for
the tiles of \(A\) and \(B\)). This keyword is how we cause the storage
to be allocated in shared memory (and therefore it can be used only
in __device__
functions, not __host__
functions).TILE_WIDTH
is a C macro that we assume has been defined elsewhere.Calculation of the \(C\) element indices row
and col
is done using
TILE_WIDTH
, where previously blockDim.x
and blockDim.y
appeared. This works because we have defined the tile to be the
same size as the thread block. In theory it could be different, but
doing so gives us the very convenient consequence that each thread
needs only to load a single element from each of \(A\) and \(B\) into
shared memory to construct the tiles. This means the host code that
calls the kernel needs to use TILE_WIDTH
to define the block size:
gridXSize = (numCCols-1)/TILE_WIDTH + 1;
gridYSize = (numCRows-1)/TILE_WIDTH + 1;
dim3 DimGrid(gridXSize, gridYSize, 1); // gridSize blocks in the
grid
dim3 DimBlock(TILE_WIDTH, TILE_WIDTH, 1); // blockSize threads in
each block
matrixMultiply<<<DimGrid,DimBlock>>>(deviceA, deviceB,
deviceC, ...
We have put some logic around the statements that transfer data to
the shared-memory tile storage. Since we can't guarantee that there
will be a whole number of thread blocks in the matrix, this prevents
threads whose row
, col
indices are outside the bounds of either
A or B from attempting to retrieve data that isn't there.
__syncthreads()
. This is a barrier
synchronization across all threads that ensures all threads complete
any work up to this point before any proceed further. Without this,
some threads could move on to begin computing matrix elements before
other threads have loaded the correct data into shared memory, and
out-of-date data could be used.