非交換運算元的多塊並行約簡
並行縮減的多塊方法與單塊方法非常相似。必須將全域性輸入陣列拆分為多個部分,每個部分由一個塊縮小。當獲得每個塊的部分結果時,一個最終塊減少這些以獲得最終結果。
- 在單塊縮減示例中更詳細地解釋了
sumNoncommSingleBlock
。 lastBlock
只接受到達它的最後一個塊。如果要避免這種情況,可以將核心拆分為兩個單獨的呼叫。
static const int wholeArraySize = 100000000;
static const int blockSize = 1024;
static const int gridSize = 24; //this number is hardware-dependent; usually #SM*2 is a good number.
__device__ bool lastBlock(int* counter) {
__threadfence(); //ensure that partial result is visible by all blocks
int last = 0;
if (threadIdx.x == 0)
last = atomicAdd(counter, 1);
return __syncthreads_or(last == gridDim.x-1);
}
__device__ void sumNoncommSingleBlock(const int* gArr, int arraySize, int* out) {
int thIdx = threadIdx.x;
__shared__ int shArr[blockSize*2];
__shared__ int offset;
shArr[thIdx] = thIdx<arraySize ? gArr[thIdx] : 0;
if (thIdx == 0)
offset = blockSize;
__syncthreads();
while (offset < arraySize) { //uniform
shArr[thIdx + blockSize] = thIdx+offset<arraySize ? gArr[thIdx+offset] : 0;
__syncthreads();
if (thIdx == 0)
offset += blockSize;
int sum = shArr[2*thIdx] + shArr[2*thIdx+1];
__syncthreads();
shArr[thIdx] = sum;
}
__syncthreads();
for (int stride = 1; stride<blockSize; stride*=2) { //uniform
int arrIdx = thIdx*stride*2;
if (arrIdx+stride<blockSize)
shArr[arrIdx] += shArr[arrIdx+stride];
__syncthreads();
}
if (thIdx == 0)
*out = shArr[0];
}
__global__ void sumNoncommMultiBlock(const int* gArr, int* out, int* lastBlockCounter) {
int arraySizePerBlock = wholeArraySize/gridSize;
const int* gArrForBlock = gArr+blockIdx.x*arraySizePerBlock;
int arraySize = arraySizePerBlock;
if (blockIdx.x == gridSize-1)
arraySize = wholeArraySize - blockIdx.x*arraySizePerBlock;
sumNoncommSingleBlock(gArrForBlock, arraySize, &out[blockIdx.x]);
if (lastBlock(lastBlockCounter))
sumNoncommSingleBlock(out, gridSize, out);
}
理想情況下,想要在完全佔用時啟動足夠的塊以使 GPU 上的所有多處理器飽和。超過這個數字 - 特別是,啟動與陣列中的元素一樣多的執行緒 - 會適得其反。這樣做不會再增加原始計算能力,但會阻止使用非常有效的第一個迴圈。