Saturday, April 30, 2011

Parallel Convex Hull CUDA

1) gHull -
- They haven't released the source code till now.

- They specifically focus on CUDA.

3) Graham Scan algorithm for convex hull also seemed quite popular for parallelizing
- Chris Harrison's page on the same algorithm is helpful; also has a sequential-version source-code.

4) Mathematica's page on CUDA convex hulls

5) Optimal Multi-Core Convex Hull

6) Doing QuickHull on GPU - NVIDIA research summit poster here
- That's a research paper.
- Source Code: No idea

Among most of the above algorithms I think that QuickHull is the most amenable for parallelization.

Saturday, April 23, 2011

Digging even more with ICP

The options that come with the CICP class are crucial
- icp.options.smallestThresholdDist(playing with this creates horribly bad clouds)
- icp.options.thresholdDist
- icp.options.thresholdAng

- icp.options.ICP_algorithm(Only 'icpClassic' is implemented for ICP-3D)
- icp.options.maxIterations(its default may create a problem, better explicitly keep it very high)

Why is ICP not coverging?
How to control the internal RANSAC within ICP (can we provide known correspondences to it?)
Why ICP returns pdf and then we need to take mean? Isn't ICP supposed to compute just one R,T?

Friday, April 22, 2011

Digging more with ICP

This post is a continuation of my last post that talks about using ICP to register two coarsely registered point clouds. That post talked about getting the test.cpp code that comes with MRPT library to perform ICP on two point-clouds. That code synthetically creates two small point-clouds, introduces a random small difference in there poses, and then calls ICMP::align3D() to register the point-clouds.

My plan is to use the ICP API for kinect-generated point-clouds, which are naturally much bigger and noisier than the synthetic point-clouds the original test.cpp generates. 

Until now I did the following tests:
case 1) A single kinect-generated point-cloud was fed as two point clouds, the difference being the same random pose difference that the original example introduces. This worked great. The point-cloud were aligned nicely. Note that the same point-cloud was fed (with some small pose difference), so there was 100% overlap between the two point-clouds

case 2) So now I fed two point-clouds with around 20% overlap, but already almost registered (manually in MeshLab). This worked horribly

So, we see that, initial coarse registration doesn't really help as much as the overlap does.

case 3) Now I am thinking, before feeding the point-clouds to ICP,  we can delete the non-overlapping regions from the both the pointclouds, so that the %-overlap would be much much higher. Note that doing ICP with such new point-clouds would still result in the same R,T output. So, we would get the same output conservatively. The advantage is two-fold: smaller input point-clouds means faster ICP and higher-%-overlap means much more accurate answer.

As expected, This worked better: much much more accurate answer, plus, much faster :)

Note that though the point clouds had more %-overlap they were NOT coarsely registered already as we had in case(2), that makes it little harder. So if we had an initial coarse pose estimate, then I believe we should get awesome results. 
So let's do that-
Now, we will not only keep just the overlapped parts of the point cloud, but keep them coarsely aligned. That would be our case (4).

case 4) Shockingly and sadly, it went horrible x-( .... I should try it again !!!!

Thursday, April 21, 2011

my progressive photon mapping implementation

Finally I have got progressive photon mapping (without participating media, with surfaces) working :) yay!
Details on that would come later, when I would get some time margin to write about it descriptively, but the following output image may give you an idea of how much has been done till now...

Let me be honest, I loved the caustics :) If you see carefully, you may find some color bleeding happening too...which makes me even happier! The rendering is with a single light-source emitting 1 billion photons, and took around 4-5 minutes of execution time.

The next task in line is inclusion of participating media. I have been reading [0], [1], and [2] to understand the concepts needed, although they won't tell me how to do that in my framework of progressive photon mapping. So, I will be needing to understand how it works with participating media in regular photon mapping, and then design how it should be done in progressive framework. Hope that's fun...;)

Other important resources which would be helpful sooner or later
1. Praca dyplomowa's CUDA implementation for participating media rendering with photon mapping
2. Really good wiki that explains that concepts of participating media rendering in a very friendly and grounds-up manner
3. A survey on participating media rendering techniques (2005)

[0] A Practical Guide to Global Illumination using Photon Maps Sig'2000
[1] Efficient Simulation of Light Transport in Scenes with Participating Media using Photon Maps Henrik Wann Jensen
[2] Jensens Book Realistic-Image-Synthesis-Using-Photon-Mapping (Chapter 10)

Tuesday, April 12, 2011

Loading .OBJ 3D model files in OpenGL code

Tudor Carean provides an extended version of GLM and tutorial on how to use it in your code. He also provides tutorials on Creating a 3D Model, Exporting the model, Getting GLM, Importing the model in C++ and Details about GLM structure.
I would personally prefer sticking to the original GLM, instead of the one Tudor wrote, just because I believe its would be safer. You can download the original GLM from here.

What is GLM?[0]
Glm is a library that was provided in the examples folder of the GLUT source distribution. It provides API functions to read Wavefront .OBJ file (and corresponding .mtl materials file), and also for rendering these models in your openGL program. The required input files can be generated using 3DS Max or MeshLab (see here to know how to do that). Basically GLM knows how to read a Wavefront .OBJ file and knows how to draw it on the screen exactly the way it looked in 3D Studio Max. Well actually it can do a lot more[0].
GLM reads the model files into a data structure it calls as _GLMmodel. To know more about this data structure and understand its internals you can read Tudor Carean's tutorial for the same.

Good Online Sources for Downloading Free 3D Models 

CUDA Kd-tree Implementations

My motivation of going after CUDA kd-tree implementations is to be able to do the k-Nearest-Neighbors (a.k.a. KNN queries) queries in Delaunay triangulation (DT) procedure. The incremental algorithm of DT requires us, to find the a point with nearest delaunay distance to the given edge. The delaunay distance of a point to the line is not equal to the Euclidean distance between the two, but the points that are Euclidean-distance-wise closer to the edge, are more probably of being closer delaunay-distance-wise. Without any optimization or acceleration structure, finding the delaunay-closest point would require us to check every point in the input point-cloud to find the minimum. So, if we could somehow conservatively limit ourselves to a subset of all n points, that would speed-up the algorithm significantly. Ergo, I thought of kd-tree. If we could do a prepass where we take all n points and organize them in a kd-tree, finding such a subset of points is equivalent to doing KNN queries. There is still one unanswered question in my mind, why isn't the euclidean-nearest point to an edge equal to the delaunay-nearest point to the edge? Or is it? If yes, then why don't we just find the closest find (the nearest neighbor), why to do all the circumcircle computations? There should be some reason why we do so, that makes the euclidean-closest point not necessarily the delaunay-closest point to a given edge.
Well, following are, in my opinion, good resources for kd-tree on CUDA:


[1]  MNRT  (here)
The "Source Code Availability" section of the documentation of MNRT says that "Source code of MNRT will be available soon. I still need to update some Doxygen comments to be ready for this step". So no source code for now.
so, MNRT rating: 1/5 - mainly because lack of source code ;)

[2] <<TODO>>
I could download and compile the code successfully. The code is meant for Windows and uses some Windows-specific calls for setting up the timers, which can be simply commented out to make the code run on Linux (I use Ubuntu Maverick 10.10). The code takes an .obj model file, reads its vertices, and constructs a kd-tree with those points on GPU. It writes the kd-tree blocks into an output file (named "kdtree.will").
I suspect that the code is an implementation of [0], simply because the one authors of the code and that of the paper is named Rui. But that's just a guess. If it's so, the paper can help understand the code more.
The only problem I found with the code is that it does not provide the nearest-neighbor queries to the kd-tree, yet. It just constructs it. The next code I will talk about even provides such functions.
so, rating:  3/5 - good code but no nearest neighbor queries yet.

[3] <<TODO>> Nghia Ho's blog post on KD-tree on GPU for 3D point searching discusses his implementation and results. His implementation also includes the nearest neighbor searching.
so, rating:  ??/5 - good code but no nearest neighbor queries yet.

[4] <<TODO>> Shawn Brown's implementation -
* is 2010 work.
* has source-code.
* support for 2D, 3D, 4D point data.
* has GPU kernels for Query nearest neighbor (QNN), All-Nearest Neighbor (All-NN), 'k' nearest neighbor (kNN), and All 'k' Nearest Neighbor (All-kNN).
* has documentation- has corresponding 2010 paper by himself.
so, rating:  ??/5 - all above reasons.

[5] Zidan's implementation:
* Not tested yet!

[6] Kd tree on GPU for 3D point searching

Monday, April 11, 2011

Registering Two Point Clouds

Problem Statement
Given two 3D point clouds with a good enough (how much??) overlap with each other (overlap in terms of scene information), we wish to combine them to build a composite point-cloud, which would naturally be denser than the two input point-clouds.

ICP Implementations
[2]    <<TODO>> 3D-ICP example here.
        -It provides a tiny example C file that uses ICP API (CICP::Align3D()) provided by the MPRT library (librariessvntutorialsdocumentation). 
        - MPRT provides its source code as an SVN, so we can look into the ICP implementation if we want to.
[4]     Don't know but some C++ code, check if the above ones ain't good enough.
From the above ones, I found the MRPT library implementation (the 2nd one in the above list) most reliable, so I am going for it. 

Getting The MRPT ICP Example Running
Today, I finally could run a demo test application of ICP that uses the MRPT library! Yippee :) Following are the steps I followed:
1. Download the source code for MRPT from its SVN. Its quite large-sized code, ~150 MBs.
2. Build the source ( regular "cmake . and make"  way). This will compile the source, it takes significant time. This will create the libraries in its local path.
3. From synaptic, install the mrpt libraries (libmrpt-dev, mrpt-apps, mrpt-doc, mrpt-libs).
4. Get the test.cpp source code of this example
5. Now you want to compile the test.cpp file. This step, as you might have guess already, took some time. Following is what I did to get this step done...
OK, to start with, I tried to build with simple "g++ test.cpp" command, but it doesn't work. I don't know why. It fails to find the required header files which are buried at various places in /usr/include/mrpt/... . I found the following way out : this web-page on MRPT website particularly encourages to use cmake for compiling and linking the applications that use the MRPT library. On the same webpage, it provides with a sample CMakeLists.txt file. So I created a CMakeLists.txt file in the same folder as test.cpp. 
Then, according to what the page suggests, I had to do "ccmake .", then "cmake ." and then "make". To get the ccmake package I had to install cmake-curses-gui from Synaptic.
So, I did a "ccmake .". A GUI-kind-of thing opened up, where I pressed 'c' to configure (which had to be done twice, for some reason :-o) then pressed 'g' to generate and exit. Then I did "cmake ." and then "make" but make failed. I got following error-

.../mrpttest> make
Scanning dependencies of target mrpt_example1
Linking CXX executable mrpt_example1
/usr/bin/ld: cannot find -lmrpt-base
collect2: ld returned 1 exit status
make[2]: *** [mrpt_example1] Error 1
make[1]: *** [CMakeFiles/mrpt_example1.dir/all] Error 2
make: *** [all] Error 2

So it is not able to locate the mprt-base library. I tried to find the library file in the standard Linux paths where libs are typically stored, but failed to find these libraries in 5 mins. However, I had explicitly compiled the svn source code of the MRPT library, so I had the compiled MRPT libraries with me there. I thought, for time being at least, I would use them. In order to use them, I put the following line in my CMakeLists.txt file-
Note that, for the reason mentioned here, the LINK_DIRECTORIES() clause has to be placed before the ADD_EXECUTABLE clause in the CMakeLists.txt file. Do not miss this part, this is crucial. 

After, the above one-line change in the CMakeLists.txt file, I did ccmake, cmake, and make again. And bingo! it works now! 
I got the following output which is the same as the screenshots displayed here.

Saturday, April 9, 2011

Delaunay Triangulation with CUDA - progress

I have been really irritated since I have not done any substantial progress till now. Finally today when I resumed working on this project, I think I have found my ray of hope! The ray is known is Dan Maljovec. He is a student from University of Utah and has worked on the exact same project in 2010. Here is his homepage. I am going to understand his algorithm in next one hour. C ya.

My Crappy Algo
Designed a not-very-bad algorithm:

Algorithm DT(points)

1) Find the bounding box of the points (thrust has API function)
2) Find Delaunay triangulation of the BB+points together. The bounding box is now the convex hull of the whole system. Its edges will be in DT.
put the four edges in the AFL ( Active Faces List).
      Call Kernel<<n_blocks, n_threads>>
      a) each block undertakes one edge at a time and will take the responsibility of finding the simplex (a triangle in 2D case) mounting on the edge.
      b) if a block has no edge to work on, it will just wait till the edge is ready.
      c) Otherwise, all the hundreds of threads of each block, will check all points (hopefully just a subset of them, using some lossless optimization) and will find the point with nearest delaunay distance. Call it Ndd point.
     d) the triangle (the edge, Ndd) is added to DT. Here we don't want conflicting writes between different threads or different blocks.
     e) Once the edge has been processed as above, the edge is removed from AFL, and the next edge in the current blocks AFL will be taken by the block.
Here we may do better in step (e). A block when finishes processing an edge, has created two new edges. And its going to remove the edge it processed from the AFL. What if it replaces it with one of the new edges, and then starts processing the other new edge immediately. I see that many blocks attempting to write one of their new edges back into a common work-list plus others trying to read it at the same time is a very bad idea, but I liked the other part: Once you, as a block, are done with one edge, start working with one of the new edges immediately, since it is equivalent to (but more efficient than) writing both new edges back to worklist and then reading one edge from the same list. Well, lets keep this part of our idea in our mental to-do list. Next thing, is that what looks more apt than a list is a queue ofcourse, and not really a list.

1) multi-front or multi-rear queues
2) multiple queues: since as it is every work-item is independent

0) how do we guarantee that we don't create a triangle that was already there. Or , in fact, we don't want to even create an edge which has already been created before.

1) is really every-work-item (every edge in the work-queue) really independent?? Say one block is processing one edge e1, and other block is processing some arbitrary e2, then unfortunately if e1 and e2 are really close, then are the block going to collide. worst case, are they going to create contradictory triangles?

Reasons why Thrust is great for me!!
  1. The "examples" include
  2. The "examples" include
  3. The API has min_element() function to get the minimum in parallel.
  4. creating random 2D points API ;-)

Which One is Better To Parallelize- DeWall or InCoDe? :-(

I am confused which one to choose as the base algorithm.

DeWall implementation here
Parallel CUDA 2D Convex Hull with CUDA Paper

Sequential Incremental Construction Algorithm
Parallel Incremental Construction Algorithm
  • ParInCoDe Paper: First one to parallelize it 
    • scalable: Divide and independently conquer, No merging.

Monday, April 4, 2011

Delaunay Triangulation using CUDA

I have planned to implement Delaunay triangulation in CUDA given a set of points.

Delaunay triangulation of a given set of points in d-dimensional Euclidean space is accurately defined in [2]. They define it as following:

The Delaunay Triangulation (DT(P)) of the Ed space defined on a pointset P is the set of
d-simplices such that:
1. a point p in Ed is vertex of a simplex in DT(P) iff p ∈ P ;
2. the intersection of two simplices in DT(P) is either an empty set or a common face;
3. the hyper-sphere circumscribed around the d + 1 vertices of each simplex contains no other point
of the set P.

Here, by Ed we mean d-dimensional Euclidean space. It should be noted that, the Delaunay triangulation of d-dimensional points gives d-simplices, meaning that, that of 3D points will give tetrahedrons, and that of 2D points would give triangles.

Initially I intended to do it for 3D point clouds, so that I can triangulate the point cloud that I get from Kinect.  (With Kinect's point clouds, we will most probably be additionally bugged by the noise in the depth information). However,  that may be too much of a task for too small time. So, I have unhappily decided to do it just for a set of 2D points. So that, probably I would need to worry less about the algorithm, and probably give more time for learning CUDA, which is basically the formal intention of the project.

The duality between Delaunay triangulations and Voronoi diagrams is well known and thus algorithms are given for the construction of the former from the latter. However, it is generally more efficient to directly construct the triangulation, and in fact the construction time for a Delaunay triangulation from a Voronoi diagram is O(n). For these reasons I am deliberately not looking into the algorithms that construct the triangulation for Voronoi diagram.

So after spending some time surfing, I have shortlisted following papers:

1) A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips (Aug. 2010)
     The algorithm is a flip algorithm[0]. The input to the algorithm is some triangulation (absolutely arbitrary??) of the points, and the algorithm performs some edge flippings[1], and ends up making it the Delaunay triangulation.  (Delaunay triangulation for a given set of points is unique if the points are in so-called "general position").
    The paper specifically talks about CUDA, which makes it a nice candidate in my list. The only thing I need to think about is, how do I get (even if arbitrary) valid triangulation to start with? Triangulation by definition is 

2) An Improved Parallel Algorithm For Delaunay Triangulation on Distributed Memory Parallel Computers (1997)
- A divide-and-conquer merge-sort-type algorithm. It divides the problem into smaller subproblems and then the final delaunay triangulation is obtained by merging the sub-triangulations. Merging typically is the crux of the algorithm.

3) Parallel 3D Delaunay Triangulation (found here)
- A kind of mini-compendium for parallel Delaunay triangulation algorithms. Explains how to parallelize two classes of the algorithms doing the job, divide-and-conquer ("Delaunay Wall" algorithm) and incremental construction method.

4) Computing Two-dimensional Delaunay Triangulation Using Graphics Hardware (here)
- The paper utilizes the typical graphics pipeline for doing the job, also uses CUDA for some part of the algorithm and uses CPU for some part.

"As you adequately put, the problem is choice." - Architect to Neo (from "Matrix Reloaded")

Which algorithm to choose finally? Keeping CUDA in mind, I ideally want an algorithm that can be immediately (preferably logically) broken down into essentially hundreds of sub-problems, such that the processing is as much SIMD like as possible in small pieces, and the merging of the subsolutions should not be heavy; or at least be parallelizable. I am thinking of going for (2).