Tuesday, March 5, 2013

Intuitive understanding of the parallel SCAN algorithm

Due to lack of time I won't describe what Scan is and I also won't describe the algorithm separately. You can find a very helpful and clear description of both of them here. What we'll try to do is to try and understand the algorithm intuitively. And while doing so we will also see how the algorithm proceeds.


Let's say we have 16 elements in the original input vector.Now, there are different passes of the algorithm. Each k'th pass is a transformation in itself, meaning it will accept an 16-vector from the previous pass and produce a transformed (i.e. with potentially different values as elements) 16-vector as output. Every k'th pass is associated with an offset value, such that to each vec[i], the pass adds vec[i-offset]. The offset for k'th (k starting from 0) pass is 2^k. So first pass adds vec[i-1], second pass adds vec[i-2], third pass adds vec[i-4]. The algorithm claims that after log(n) number of passes, the prefix sum is ready. The question is: How?

In the first pass, every i'th element will add to itself the element immediately before it.  So, the penetration of every element is -1, meaning we have so far summed 1 element to the left of vec[i]. Now, when we do the second pass, every vec[i] will add vec[i-2] to itself. Now what's the resulting penetration of vec[i]? -2? No! It's -3. As vec[i-2] already has penetrated -1 relative to itself, every vec[i]'s effective penetration becomes -3. So, every ith element now stores the sum of 3 previous elements (and itself ofcourse). Now, in the third pass, the offset is 4, so we add vec[i-4] to vec[i] for each i, and as vec[i-4] already had a penetration of -3 so far, vec[i]'s resultant penetration becomes -7. I guess you have an idea of how the penetrations are getting cumulatively stretched roughly exponentially. Now, in the last (fourth) pass, the offset is 8, so for each i, we add vec[i-8] to vec[i]. And as vec[i-8]'s penetration was already -7, vec[i]'s effective penetration becomes -8-7 = -15. For a 16-element vector, -15 is the maximum penetration needed for obtaining the prefix sum. So, the algorithm terminates. 

I hope that helps.

Friday, March 1, 2013

Running your first CUDA program on Windows 8

Today I bought a new laptop with NVIDIA GeForce GT 660M card, and it came with pre-installed Windows 8. I performed the following steps, and have successfully executed my first small CUDA code.

1. Installed MS Visual Studio (Visual C++) Express 2010 Edition
2. Installed most recent CUDA Toolkit (5.0) from here.
3. Executed the NVidia CUDA samples that came along with the toolkit (as a part of the computing SDK I guess). They executed correctly. However I was not able to compile the corresponding VC projects in VS 2010.
4. So, I followed these steps (and as a part of that wrote a simple and small CUDA code for vector addition within a new VS project). Now my new CUDA code compiled successfully.
5. To allow syntax highlighting for CUDA (.cu) files in VS, I followed these steps. And now, the syntax highlighting (even for CUDA keywords such as BlockIdx) is working fine! :).

Beginning with Parallel Nsight

To download Parallel Nsight Visual Studio edition you need to register on NVidia's developers forum for Parallel Nsight developer profile (after you register for the basic profile first). Then they approve you and you get to download the software. Installation of the software requires Visual Studio 2010 Professional, meaning it does not work with Express editions.

The debug begins

After installing that, then debugging my first program was very straightforward due to the seamless integration of Nsight with Visual Studio. I just set breakpoints at a couple of places within my kernel snippet and then instead of initiating regular VS type debugging, I clicked Nsight->Start CUDA Debugging. However, note that Start CUDA Debugging does not compile the code. You will have to build the project first before clicking Start CUDA Debugging so that your recent changes will be reflected.