Ray Tracing Diary: CUDA

Showing posts with label CUDA. Show all posts

Tuesday, April 12, 2011

CUDA Kd-tree Implementations

My motivation of going after CUDA kd-tree implementations is to be able to do the k-Nearest-Neighbors (a.k.a. KNN queries) queries in Delaunay triangulation (DT) procedure. The incremental algorithm of DT requires us, to find the a point with nearest delaunay distance to the given edge. The delaunay distance of a point to the line is not equal to the Euclidean distance between the two, but the points that are Euclidean-distance-wise closer to the edge, are more probably of being closer delaunay-distance-wise. Without any optimization or acceleration structure, finding the delaunay-closest point would require us to check every point in the input point-cloud to find the minimum. So, if we could somehow conservatively limit ourselves to a subset of all n points, that would speed-up the algorithm significantly. Ergo, I thought of kd-tree. If we could do a prepass where we take all n points and organize them in a kd-tree, finding such a subset of points is equivalent to doing KNN queries. There is still one unanswered question in my mind, why isn't the euclidean-nearest point to an edge equal to the delaunay-nearest point to the edge? Or is it? If yes, then why don't we just find the closest find (the nearest neighbor), why to do all the circumcircle computations? There should be some reason why we do so, that makes the euclidean-closest point not necessarily the delaunay-closest point to a given edge.
Well, following are, in my opinion, good resources for kd-tree on CUDA:

[0] http://www.nvidia.com/content/nvision2008/tech_presentations/Game_Developer_Track/NVISION08-Interactive_Ray_Tracing.pdf

[1] MNRT (here)
The "Source Code Availability" section of the documentation of MNRT says that "Source code of MNRT will be available soon. I still need to update some Doxygen comments to be ready for this step". So no source code for now.
so, MNRT rating: 1/5 - mainly because lack of source code ;)

[2] <<TODO>> http://code.google.com/p/cuda-kdtree/source/checkout
I could download and compile the code successfully. The code is meant for Windows and uses some Windows-specific calls for setting up the timers, which can be simply commented out to make the code run on Linux (I use Ubuntu Maverick 10.10). The code takes an .obj model file, reads its vertices, and constructs a kd-tree with those points on GPU. It writes the kd-tree blocks into an output file (named "kdtree.will").
I suspect that the code is an implementation of [0], simply because the one authors of the code and that of the paper is named Rui. But that's just a guess. If it's so, the paper can help understand the code more.
The only problem I found with the code is that it does not provide the nearest-neighbor queries to the kd-tree, yet. It just constructs it. The next code I will talk about even provides such functions.
so, rating: 3/5 - good code but no nearest neighbor queries yet.

[3] <<TODO>> Nghia Ho's blog post on KD-tree on GPU for 3D point searching discusses his implementation and results. His implementation also includes the nearest neighbor searching.

so, rating: ??/5 - good code but no nearest neighbor queries yet.

[4] <<TODO>> Shawn Brown's implementation - http://www.cs.unc.edu/~shawndb/
* is 2010 work.
* has source-code.
* support for 2D, 3D, 4D point data.
* has GPU kernels for Query nearest neighbor (QNN), All-Nearest Neighbor (All-NN), 'k' nearest neighbor (kNN), and All 'k' Nearest Neighbor (All-kNN).
* has documentation- has corresponding 2010 paper by himself.
so, rating: ??/5 - all above reasons.

[5] Zidan's implementation: http://www.mahmoudzidan.com/WorkArchive/
* Not tested yet!

[6] Kd tree on GPU for 3D point searching: http://nghiaho.com/?p=437

Monday, April 4, 2011

Delaunay Triangulation using CUDA

I have planned to implement Delaunay triangulation in CUDA given a set of points.

Delaunay triangulation of a given set of points in d-dimensional Euclidean space is accurately defined in [2]. They define it as following:

The Delaunay Triangulation (DT(P)) of the Ed space deﬁned on a pointset P is the set of

d-simplices such that:
1. a point p in Ed is vertex of a simplex in DT(P) iﬀ p ∈ P ;
2. the intersection of two simplices in DT(P) is either an empty set or a common face;
3. the hyper-sphere circumscribed around the d + 1 vertices of each simplex contains no other point
of the set P.

Here, by Ed we mean d-dimensional Euclidean space. It should be noted that, the Delaunay triangulation of d-dimensional points gives d-simplices, meaning that, that of 3D points will give tetrahedrons, and that of 2D points would give triangles.

Initially I intended to do it for 3D point clouds, so that I can triangulate the point cloud that I get from Kinect. (With Kinect's point clouds, we will most probably be additionally bugged by the noise in the depth information). However, that may be too much of a task for too small time. So, I have unhappily decided to do it just for a set of 2D points. So that, probably I would need to worry less about the algorithm, and probably give more time for learning CUDA, which is basically the formal intention of the project.

The duality between Delaunay triangulations and Voronoi diagrams is well known and thus algorithms are given for the construction of the former from the latter. However, it is generally more efficient to directly construct the triangulation, and in fact the construction time for a Delaunay triangulation from a Voronoi diagram is O(n). For these reasons I am deliberately not looking into the algorithms that construct the triangulation for Voronoi diagram.

So after spending some time surfing, I have shortlisted following papers:

1) A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips (Aug. 2010)
The algorithm is a flip algorithm[0]. The input to the algorithm is some triangulation (absolutely arbitrary??) of the points, and the algorithm performs some edge flippings[1], and ends up making it the Delaunay triangulation. (Delaunay triangulation for a given set of points is unique if the points are in so-called "general position").

The paper specifically talks about CUDA, which makes it a nice candidate in my list. The only thing I need to think about is, how do I get (even if arbitrary) valid triangulation to start with? Triangulation by definition is

2) An Improved Parallel Algorithm For Delaunay Triangulation on Distributed Memory Parallel Computers (1997)
- A divide-and-conquer merge-sort-type algorithm. It divides the problem into smaller subproblems and then the final delaunay triangulation is obtained by merging the sub-triangulations. Merging typically is the crux of the algorithm.

3) Parallel 3D Delaunay Triangulation (found here)
- A kind of mini-compendium for parallel Delaunay triangulation algorithms. Explains how to parallelize two classes of the algorithms doing the job, divide-and-conquer ("Delaunay Wall" algorithm) and incremental construction method.

4) Computing Two-dimensional Delaunay Triangulation Using Graphics Hardware (here)
- The paper utilizes the typical graphics pipeline for doing the job, also uses CUDA for some part of the algorithm and uses CPU for some part.

Discussion
"As you adequately put, the problem is choice." - Architect to Neo (from "Matrix Reloaded")

Which algorithm to choose finally? Keeping CUDA in mind, I ideally want an algorithm that can be immediately (preferably logically) broken down into essentially hundreds of sub-problems, such that the processing is as much SIMD like as possible in small pieces, and the merging of the subsolutions should not be heavy; or at least be parallelizable. I am thinking of going for (2).

[0] http://en.wikipedia.org/wiki/Delaunay_triangulation#Flip_algorithms

[1] http://en.wikipedia.org/wiki/Delaunay_triangulation#Visual_Delaunay definition:Flipping
[2] http://www.mendeley.com/research/parallel-3d-delaunay-triangulation/

Thursday, March 31, 2011

Starting Off With CUDA

First Few Steps...
About to kick off with CUDA, and I admit I'm pretty excited! Yesterday, I managed to compile and execute my first CUDA program (I'm using Linux), which was basically taken from this blog.
(A) CUDA tutorial blog I (here)
This blog is good for beginners. From the same blog following posts are helpful:
My first CUDA program
Threads and block and grids, oh my!

(B) CUDA tutorial blog II (here)
Another good blog. Following posts are found helpful:
Inter-thread communication: explains a code snippet for reduction
Atomic operations: demonstrates why using atomicAdd() (or any AtomicXXX() is a bad idea)

(C) CUDA Online Course (svn downloadable) by Stanford folks
here

CUDA Helper Libraries
(see the CUDA tools and ecosystem page by NVIDIA)
(A) Thrust Standard Template Library
Thrust is a C++ template library for CUDA. Thrust allows you to program GPUs using an interface similar the C++ Standard Template Library (STL). These templates are a way to write generic algorithms and data structures. The template library is simply a cohesive collection of such algorithms and data structures in a single package, acts as a wrapper around CUDA API.
Downloads
Documentation
FAQ
Note that Thrust requires CUDA 3.0 or newer.

(B) CUDA Utility (CUTIL)
I Couldn't find any documentation for the library anywhere. However the header file itself (NVIDIA_CUDA_SDK/common/inc/cutil.h) has doxygen-style comments and will tell you exactly what they do.
CUTIL provides functions for:

Parsing command line arguments
Read and writing binary files and PPM format images
Comparing arrays of data (typically used for comparing GPU results with CPU)
Timers
Macros for checking error codes

Debugging CUDA

Check out the debugging and profiling page on NVIDIA's website.
I am thinking of using totalview mainly because its has GUI, lets me debug both host and device code simultaneously.

Miscellaneous Issues

CUDA FAQ by NVIDIA

This page describes how to dynamically allocate shared memory.
CUDA Data structures (Kdtree, Octree, etc.) implementations
CUDA Kd-tree source-code (which did not extract x-()