# How do I calculate Block Numbers

I am writing a CUDA code and I am using GForce 9500 GT graphics card.

I am trying to process array of 20000000 integer elements and the thread number I am using is 256

The warp size is 32. The compute capability is 1.1

This is the hardware http://www.geforce.com/hardware/desktop-gpus/geforce-9500-gt/specifications

Now, block num = 20000000/256 = 78125 ?

This sound incorrect. How do I calculate the block number? Any help would be appreciated.

My CUDA kernel function is as follows. The idea is each block will calculate its sum and then final sum will be calculated by adding sum of each block.

```__global__ static void calculateSum(int * num, int * result, int DATA_SIZE)
{
extern __shared__ int shared[];
const int bid = blockIdx.x;

shared[tid] = 0;
for (int i = bid * THREAD_NUM + tid; i < DATA_SIZE; i += BLOCK_NUM * THREAD_NUM) {
shared[tid] += num[i];
}

int offset = THREAD_NUM / 2;
while (offset > 0) {
if (tid < offset) {
shared[tid] += shared[tid + offset];
}
offset >>= 1;
}

if (tid == 0) {
result[bid] = shared[0];

}
}
```

And I call this function as

```calculateSum <<<BLOCK_NUM, THREAD_NUM, THREAD_NUM * sizeof(int)>>> (gpuarray, result, size);
```

Where THREAD_NUM = 256 and gpu array is of size 20000000.

Here I am just using block number as 16 but not sure if it is right? How can I make sure that maximum parallelism is achieved?

Here is the output of my CUDA Occupancy calculator. It says I will have 100% occupancy when block number is 8. So that means I will have maximum efficiency when block number = 8 and thread number =256. Is that correct?

Thanks

If each thred process one element, and each block has 256 threads, you should run 20000000 threads, resulting in exactly 78125 blocks. This is perfectly valid number.

However, there is a little problem. I don't have CC1.1 device at hand, but in CC1.3:

```Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
```

So you should either run kernel several times for different parts of data, or make 2D-grid and just trivially tranform 2D address of thread to 1D address of array element.

The kernel code you have posted can process any input data size, independent of the number of blocks you choose to launch. The choice should be simply down to performance.

As a rule of thumb, for this sort of kernel, you want as many blocks as will run concurrently on a single multiprocessor times the number of multiprocessors on the card. The first number can be obtained using the CUDA occupancy spreadsheet which ships in the CUDA toolkit, but the upper limit will be 8 blocks per multiprocessor, and the second number will be 4 for the device you have. This means no more than 32 blocks will be necessary to achieve the maximum possible parallelism, but to answer exactly requires access to a compiler, which I don't currently have.

You can also using benchmarking to determine the optimal block count experimentally, using one of 4,8,12,16,20,24,28, or 32 blocks (multiples of 4 because this is the number of multiprocessors on your card).

In your case, the total number of threads (20000000) divides evenly by the number of threads per block (256), so you can use that number (78125). If the numbers do not divide evenly, a regular integer divide will round it down and you end up with fewer threads than you need. So in that case, you need to round the result of the divide up with a function like the following:

```int DivUp(int a, int b) {
return ((a % b) != 0) ? (a / b + 1) : (a / b);
}
```

Since this function may give you more threads than there are elements, you then also need to add a test in your kernel to abort calculations on the last few threads:

```int i(blockIdx.x * blockDim.x + threadIdx.x);
if (i >= n_items) {
return;
}
```

However, there's an additional snag. Your hardware is limited to a maximum of 65535 blocks in each dimension in a grid, and limited to two dimensions (x and y). So if, after having used DivUp(), you end up with a count that is higher than that, you have two options. You can either split the workload up and run the kernel multiple times or you can use two dimensions.

To use two dimensions, you select two numbers that each are lower than the hardware limit and that, when multiplied, become the actual block count that you require. You then add code in the top of your kernel to combine the two dimensions (x and y) into a single index.

You are only using the x-Dimension of the grid in your kernel. So you are limited to 65535 blocks using cc 1.1.

20000000/256 = 78125 is correct!

So you definetly need more then 1 block.

Kernel:

```//get unique block index
const unsigned int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D

//terminate unnecessary blocks
if(blockId >= 78124)
return;

//... rest of kernel
```

The most simple approach would be using two y-blocks and check block id in kernel.

```dim3 gridDim = dim3(65535, 2);
```

this would make more then 52945 blocks useless, I don't know whats the overhead but filling up first x and then y and z dimension can create very many unused blocks, especially if reaching z dimension!

(Nvidia should definetly have provided a utility function getting the best grid usage for unique block usage inside kernel as its the case here)

For this simple example how about using the x, and y and calculating the root.

```grid(280, 280) = 78400 blocks //only 275 blocks overhead, less is not possible
```

That's one of the great advantages of compute capability 3.0. 32-bit ranges on each block makes life often easier. Why it was limited to 65535 I never understood.

But I still prefer downwards compatibility.

I would also test @talonmies variation.