Thursday, March 31, 2011

Starting Off With CUDA

First Few Steps...
About to kick off with CUDA, and I admit I'm pretty excited! Yesterday, I managed to compile and execute my first CUDA program (I'm using Linux), which was basically taken from this blog.
(A)  CUDA tutorial blog I (here)
This blog is good for beginners. From the same blog following posts are helpful:
My first CUDA program
Threads and block and grids, oh my!

(B)  CUDA tutorial blog II (here)
Another good blog. Following posts are found helpful:
Inter-thread communication: explains a code snippet for reduction
Atomic operations: demonstrates why using atomicAdd() (or any AtomicXXX() is a bad idea)


(C)  CUDA Online Course (svn downloadable) by Stanford folks
here


CUDA Helper Libraries
(see the CUDA tools and ecosystem page by NVIDIA)
(A) Thrust Standard Template Library
Thrust is a C++ template library for CUDA. Thrust allows you to program GPUs using an interface similar the C++ Standard Template Library (STL). These templates are a way to write generic algorithms and data structures. The template library is simply a cohesive collection of such algorithms and data structures in a single package, acts as a wrapper around CUDA API.
Downloads
Documentation
FAQ
Note that Thrust requires CUDA 3.0 or newer.

(B) CUDA Utility (CUTIL) 
I Couldn't find any documentation for the library anywhere. However the header file itself (NVIDIA_CUDA_SDK/common/inc/cutil.h) has doxygen-style comments and will tell you exactly what they do.
CUTIL provides functions for:
  • Parsing command line arguments
  • Read and writing binary files and PPM format images
  • Comparing arrays of data (typically used for comparing GPU results with CPU)
  • Timers
  • Macros for checking error codes




Debugging CUDA
Check out the debugging and profiling page on NVIDIA's website.
I am thinking of using totalview mainly because its has GUI, lets me debug both host and device code simultaneously.


Miscellaneous Issues
CUDA FAQ by NVIDIA
This page describes how to dynamically allocate shared memory.
CUDA Data structures (Kdtree, Octree, etc.) implementations
CUDA Kd-tree source-code (which did not extract x-()