Saturday, April 30, 2011

Parallel Convex Hull CUDA

1) gHull - http://www.comp.nus.edu.sg/~tants/gHull.html
- They haven't released the source code till now.


- They specifically focus on CUDA.


3) Graham Scan algorithm for convex hull also seemed quite popular for parallelizing
- Chris Harrison's page on the same algorithm is helpful; also has a sequential-version source-code.

4) Mathematica's page on CUDA convex hulls

5) Optimal Multi-Core Convex Hull


6) Doing QuickHull on GPU - NVIDIA research summit poster here
- That's a research paper.
- Source Code: No idea


Among most of the above algorithms I think that QuickHull is the most amenable for parallelization.

Saturday, April 23, 2011

Digging even more with ICP

The options that come with the CICP class are crucial
- icp.options.smallestThresholdDist(playing with this creates horribly bad clouds)
- icp.options.thresholdDist
- icp.options.thresholdAng

- icp.options.ICP_algorithm(Only 'icpClassic' is implemented for ICP-3D)
- icp.options.maxIterations(its default may create a problem, better explicitly keep it very high)
-icp.options.doRANSAC



Why is ICP not coverging?
How to control the internal RANSAC within ICP (can we provide known correspondences to it?)
Why ICP returns pdf and then we need to take mean? Isn't ICP supposed to compute just one R,T?

Friday, April 22, 2011

Digging more with ICP

This post is a continuation of my last post that talks about using ICP to register two coarsely registered point clouds. That post talked about getting the test.cpp code that comes with MRPT library to perform ICP on two point-clouds. That code synthetically creates two small point-clouds, introduces a random small difference in there poses, and then calls ICMP::align3D() to register the point-clouds.

My plan is to use the ICP API for kinect-generated point-clouds, which are naturally much bigger and noisier than the synthetic point-clouds the original test.cpp generates. 

Until now I did the following tests:
case 1) A single kinect-generated point-cloud was fed as two point clouds, the difference being the same random pose difference that the original example introduces. This worked great. The point-cloud were aligned nicely. Note that the same point-cloud was fed (with some small pose difference), so there was 100% overlap between the two point-clouds

case 2) So now I fed two point-clouds with around 20% overlap, but already almost registered (manually in MeshLab). This worked horribly

So, we see that, initial coarse registration doesn't really help as much as the overlap does.

case 3) Now I am thinking, before feeding the point-clouds to ICP,  we can delete the non-overlapping regions from the both the pointclouds, so that the %-overlap would be much much higher. Note that doing ICP with such new point-clouds would still result in the same R,T output. So, we would get the same output conservatively. The advantage is two-fold: smaller input point-clouds means faster ICP and higher-%-overlap means much more accurate answer.

As expected, This worked better: much much more accurate answer, plus, much faster :)

Note that though the point clouds had more %-overlap they were NOT coarsely registered already as we had in case(2), that makes it little harder. So if we had an initial coarse pose estimate, then I believe we should get awesome results. 
So let's do that-
Now, we will not only keep just the overlapped parts of the point cloud, but keep them coarsely aligned. That would be our case (4).

case 4) Shockingly and sadly, it went horrible x-( .... I should try it again !!!!

Thursday, April 21, 2011

my progressive photon mapping implementation


Finally I have got progressive photon mapping (without participating media, with surfaces) working :) yay!
Details on that would come later, when I would get some time margin to write about it descriptively, but the following output image may give you an idea of how much has been done till now...


Let me be honest, I loved the caustics :) If you see carefully, you may find some color bleeding happening too...which makes me even happier! The rendering is with a single light-source emitting 1 billion photons, and took around 4-5 minutes of execution time.

The next task in line is inclusion of participating media. I have been reading [0], [1], and [2] to understand the concepts needed, although they won't tell me how to do that in my framework of progressive photon mapping. So, I will be needing to understand how it works with participating media in regular photon mapping, and then design how it should be done in progressive framework. Hope that's fun...;)

Other important resources which would be helpful sooner or later
1. Praca dyplomowa's CUDA implementation for participating media rendering with photon mapping
2. Really good wiki that explains that concepts of participating media rendering in a very friendly and grounds-up manner
3. A survey on participating media rendering techniques (2005)
3. 

References
[0] A Practical Guide to Global Illumination using Photon Maps Sig'2000
[1] Eļ¬ƒcient Simulation of Light Transport in Scenes with Participating Media using Photon Maps Henrik Wann Jensen
[2] Jensens Book Realistic-Image-Synthesis-Using-Photon-Mapping (Chapter 10)

Tuesday, April 12, 2011

Loading .OBJ 3D model files in OpenGL code

Tudor Carean provides an extended version of GLM and tutorial on how to use it in your code. He also provides tutorials on Creating a 3D Model, Exporting the model, Getting GLM, Importing the model in C++ and Details about GLM structure.
I would personally prefer sticking to the original GLM, instead of the one Tudor wrote, just because I believe its would be safer. You can download the original GLM from here.


What is GLM?[0]
Glm is a library that was provided in the examples folder of the GLUT source distribution. It provides API functions to read Wavefront .OBJ file (and corresponding .mtl materials file), and also for rendering these models in your openGL program. The required input files can be generated using 3DS Max or MeshLab (see here to know how to do that). Basically GLM knows how to read a Wavefront .OBJ file and knows how to draw it on the screen exactly the way it looked in 3D Studio Max. Well actually it can do a lot more[0].
GLM reads the model files into a data structure it calls as _GLMmodel. To know more about this data structure and understand its internals you can read Tudor Carean's tutorial for the same.


Good Online Sources for Downloading Free 3D Models 

CUDA Kd-tree Implementations

My motivation of going after CUDA kd-tree implementations is to be able to do the k-Nearest-Neighbors (a.k.a. KNN queries) queries in Delaunay triangulation (DT) procedure. The incremental algorithm of DT requires us, to find the a point with nearest delaunay distance to the given edge. The delaunay distance of a point to the line is not equal to the Euclidean distance between the two, but the points that are Euclidean-distance-wise closer to the edge, are more probably of being closer delaunay-distance-wise. Without any optimization or acceleration structure, finding the delaunay-closest point would require us to check every point in the input point-cloud to find the minimum. So, if we could somehow conservatively limit ourselves to a subset of all n points, that would speed-up the algorithm significantly. Ergo, I thought of kd-tree. If we could do a prepass where we take all n points and organize them in a kd-tree, finding such a subset of points is equivalent to doing KNN queries. There is still one unanswered question in my mind, why isn't the euclidean-nearest point to an edge equal to the delaunay-nearest point to the edge? Or is it? If yes, then why don't we just find the closest find (the nearest neighbor), why to do all the circumcircle computations? There should be some reason why we do so, that makes the euclidean-closest point not necessarily the delaunay-closest point to a given edge.
Well, following are, in my opinion, good resources for kd-tree on CUDA:

[0] http://www.nvidia.com/content/nvision2008/tech_presentations/Game_Developer_Track/NVISION08-Interactive_Ray_Tracing.pdf


[1]  MNRT  (here)
The "Source Code Availability" section of the documentation of MNRT says that "Source code of MNRT will be available soon. I still need to update some Doxygen comments to be ready for this step". So no source code for now.
so, MNRT rating: 1/5 - mainly because lack of source code ;)

[2] <<TODO>> http://code.google.com/p/cuda-kdtree/source/checkout
I could download and compile the code successfully. The code is meant for Windows and uses some Windows-specific calls for setting up the timers, which can be simply commented out to make the code run on Linux (I use Ubuntu Maverick 10.10). The code takes an .obj model file, reads its vertices, and constructs a kd-tree with those points on GPU. It writes the kd-tree blocks into an output file (named "kdtree.will").
I suspect that the code is an implementation of [0], simply because the one authors of the code and that of the paper is named Rui. But that's just a guess. If it's so, the paper can help understand the code more.
The only problem I found with the code is that it does not provide the nearest-neighbor queries to the kd-tree, yet. It just constructs it. The next code I will talk about even provides such functions.
so, rating:  3/5 - good code but no nearest neighbor queries yet.

[3] <<TODO>> Nghia Ho's blog post on KD-tree on GPU for 3D point searching discusses his implementation and results. His implementation also includes the nearest neighbor searching.
so, rating:  ??/5 - good code but no nearest neighbor queries yet.


[4] <<TODO>> Shawn Brown's implementation - http://www.cs.unc.edu/~shawndb/
* is 2010 work.
* has source-code.
* support for 2D, 3D, 4D point data.
* has GPU kernels for Query nearest neighbor (QNN), All-Nearest Neighbor (All-NN), 'k' nearest neighbor (kNN), and All 'k' Nearest Neighbor (All-kNN).
* has documentation- has corresponding 2010 paper by himself.
so, rating:  ??/5 - all above reasons.

[5] Zidan's implementation: http://www.mahmoudzidan.com/WorkArchive/
* Not tested yet!


[6] Kd tree on GPU for 3D point searchinghttp://nghiaho.com/?p=437




Monday, April 11, 2011

Registering Two Point Clouds


Problem Statement
Given two 3D point clouds with a good enough (how much??) overlap with each other (overlap in terms of scene information), we wish to combine them to build a composite point-cloud, which would naturally be denser than the two input point-clouds.

ICP Implementations
[2]    <<TODO>> 3D-ICP example here.
        -It provides a tiny example C file that uses ICP API (CICP::Align3D()) provided by the MPRT library (librariessvntutorialsdocumentation). 
        - MPRT provides its source code as an SVN, so we can look into the ICP implementation if we want to.
[4]     Don't know but some C++ code, check if the above ones ain't good enough.
        
From the above ones, I found the MRPT library implementation (the 2nd one in the above list) most reliable, so I am going for it. 

Getting The MRPT ICP Example Running
Today, I finally could run a demo test application of ICP that uses the MRPT library! Yippee :) Following are the steps I followed:
1. Download the source code for MRPT from its SVN. Its quite large-sized code, ~150 MBs.
2. Build the source ( regular "cmake . and make"  way). This will compile the source, it takes significant time. This will create the libraries in its local path.
3. From synaptic, install the mrpt libraries (libmrpt-dev, mrpt-apps, mrpt-doc, mrpt-libs).
4. Get the test.cpp source code of this example
5. Now you want to compile the test.cpp file. This step, as you might have guess already, took some time. Following is what I did to get this step done...
OK, to start with, I tried to build with simple "g++ test.cpp" command, but it doesn't work. I don't know why. It fails to find the required header files which are buried at various places in /usr/include/mrpt/... . I found the following way out : this web-page on MRPT website particularly encourages to use cmake for compiling and linking the applications that use the MRPT library. On the same webpage, it provides with a sample CMakeLists.txt file. So I created a CMakeLists.txt file in the same folder as test.cpp. 
Then, according to what the page suggests, I had to do "ccmake .", then "cmake ." and then "make". To get the ccmake package I had to install cmake-curses-gui from Synaptic.
So, I did a "ccmake .". A GUI-kind-of thing opened up, where I pressed 'c' to configure (which had to be done twice, for some reason :-o) then pressed 'g' to generate and exit. Then I did "cmake ." and then "make" but make failed. I got following error-


.../mrpttest> make
Scanning dependencies of target mrpt_example1
Linking CXX executable mrpt_example1
/usr/bin/ld: cannot find -lmrpt-base
collect2: ld returned 1 exit status
make[2]: *** [mrpt_example1] Error 1
make[1]: *** [CMakeFiles/mrpt_example1.dir/all] Error 2
make: *** [all] Error 2

So it is not able to locate the mprt-base library. I tried to find the library file in the standard Linux paths where libs are typically stored, but failed to find these libraries in 5 mins. However, I had explicitly compiled the svn source code of the MRPT library, so I had the compiled MRPT libraries with me there. I thought, for time being at least, I would use them. In order to use them, I put the following line in my CMakeLists.txt file-
LINK_DIRECTORIES(.../mrpt-read-only/lib/)
Note that, for the reason mentioned here, the LINK_DIRECTORIES() clause has to be placed before the ADD_EXECUTABLE clause in the CMakeLists.txt file. Do not miss this part, this is crucial. 

After, the above one-line change in the CMakeLists.txt file, I did ccmake, cmake, and make again. And bingo! it works now! 
I got the following output which is the same as the screenshots displayed here.





Saturday, April 9, 2011

Delaunay Triangulation with CUDA - progress

I have been really irritated since I have not done any substantial progress till now. Finally today when I resumed working on this project, I think I have found my ray of hope! The ray is known is Dan Maljovec. He is a student from University of Utah and has worked on the exact same project in 2010. Here is his homepage. I am going to understand his algorithm in next one hour. C ya.

My Crappy Algo
Designed a not-very-bad algorithm:

Algorithm DT(points)

1) Find the bounding box of the points (thrust has API function)
2) Find Delaunay triangulation of the BB+points together. The bounding box is now the convex hull of the whole system. Its edges will be in DT.
put the four edges in the AFL ( Active Faces List).
      Call Kernel<<n_blocks, n_threads>>
      a) each block undertakes one edge at a time and will take the responsibility of finding the simplex (a triangle in 2D case) mounting on the edge.
      b) if a block has no edge to work on, it will just wait till the edge is ready.
      c) Otherwise, all the hundreds of threads of each block, will check all points (hopefully just a subset of them, using some lossless optimization) and will find the point with nearest delaunay distance. Call it Ndd point.
     d) the triangle (the edge, Ndd) is added to DT. Here we don't want conflicting writes between different threads or different blocks.
     e) Once the edge has been processed as above, the edge is removed from AFL, and the next edge in the current blocks AFL will be taken by the block.
Here we may do better in step (e). A block when finishes processing an edge, has created two new edges. And its going to remove the edge it processed from the AFL. What if it replaces it with one of the new edges, and then starts processing the other new edge immediately. I see that many blocks attempting to write one of their new edges back into a common work-list plus others trying to read it at the same time is a very bad idea, but I liked the other part: Once you, as a block, are done with one edge, start working with one of the new edges immediately, since it is equivalent to (but more efficient than) writing both new edges back to worklist and then reading one edge from the same list. Well, lets keep this part of our idea in our mental to-do list. Next thing, is that what looks more apt than a list is a queue ofcourse, and not really a list.


Possibilities-
1) multi-front or multi-rear queues
2) multiple queues: since as it is every work-item is independent

worries
0) how do we guarantee that we don't create a triangle that was already there. Or , in fact, we don't want to even create an edge which has already been created before.

1) is really every-work-item (every edge in the work-queue) really independent?? Say one block is processing one edge e1, and other block is processing some arbitrary e2, then unfortunately if e1 and e2 are really close, then are the block going to collide. worst case, are they going to create contradictory triangles?


Reasons why Thrust is great for me!!
  1. The "examples" include bucket_sort2d.cu
  2. The "examples" include discrete_voronoi.cu
  3. The API has min_element() function to get the minimum in parallel.
  4. creating random 2D points API ;-)






Which One is Better To Parallelize- DeWall or InCoDe? :-(

I am confused which one to choose as the base algorithm.

DeWall implementation here
Parallel CUDA 2D Convex Hull with CUDA Paper

Sequential Incremental Construction Algorithm
Parallel Incremental Construction Algorithm
  • ParInCoDe Paper: First one to parallelize it 
    • scalable: Divide and independently conquer, No merging.

Monday, April 4, 2011

Delaunay Triangulation using CUDA

I have planned to implement Delaunay triangulation in CUDA given a set of points.

Delaunay triangulation of a given set of points in d-dimensional Euclidean space is accurately defined in [2]. They define it as following:

The Delaunay Triangulation (DT(P)) of the Ed space deļ¬ned on a pointset P is the set of
d-simplices such that:
1. a point p in Ed is vertex of a simplex in DT(P) iļ¬€ p ∈ P ;
2. the intersection of two simplices in DT(P) is either an empty set or a common face;
3. the hyper-sphere circumscribed around the d + 1 vertices of each simplex contains no other point
of the set P.

Here, by Ed we mean d-dimensional Euclidean space. It should be noted that, the Delaunay triangulation of d-dimensional points gives d-simplices, meaning that, that of 3D points will give tetrahedrons, and that of 2D points would give triangles.


Initially I intended to do it for 3D point clouds, so that I can triangulate the point cloud that I get from Kinect.  (With Kinect's point clouds, we will most probably be additionally bugged by the noise in the depth information). However,  that may be too much of a task for too small time. So, I have unhappily decided to do it just for a set of 2D points. So that, probably I would need to worry less about the algorithm, and probably give more time for learning CUDA, which is basically the formal intention of the project.


The duality between Delaunay triangulations and Voronoi diagrams is well known and thus algorithms are given for the construction of the former from the latter. However, it is generally more efficient to directly construct the triangulation, and in fact the construction time for a Delaunay triangulation from a Voronoi diagram is O(n). For these reasons I am deliberately not looking into the algorithms that construct the triangulation for Voronoi diagram.


So after spending some time surfing, I have shortlisted following papers:

1) A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips (Aug. 2010)
     The algorithm is a flip algorithm[0]. The input to the algorithm is some triangulation (absolutely arbitrary??) of the points, and the algorithm performs some edge flippings[1], and ends up making it the Delaunay triangulation.  (Delaunay triangulation for a given set of points is unique if the points are in so-called "general position").
    The paper specifically talks about CUDA, which makes it a nice candidate in my list. The only thing I need to think about is, how do I get (even if arbitrary) valid triangulation to start with? Triangulation by definition is 

2) An Improved Parallel Algorithm For Delaunay Triangulation on Distributed Memory Parallel Computers (1997)
- A divide-and-conquer merge-sort-type algorithm. It divides the problem into smaller subproblems and then the final delaunay triangulation is obtained by merging the sub-triangulations. Merging typically is the crux of the algorithm.

3) Parallel 3D Delaunay Triangulation (found here)
- A kind of mini-compendium for parallel Delaunay triangulation algorithms. Explains how to parallelize two classes of the algorithms doing the job, divide-and-conquer ("Delaunay Wall" algorithm) and incremental construction method.


4) Computing Two-dimensional Delaunay Triangulation Using Graphics Hardware (here)
- The paper utilizes the typical graphics pipeline for doing the job, also uses CUDA for some part of the algorithm and uses CPU for some part.


Discussion
"As you adequately put, the problem is choice." - Architect to Neo (from "Matrix Reloaded")

Which algorithm to choose finally? Keeping CUDA in mind, I ideally want an algorithm that can be immediately (preferably logically) broken down into essentially hundreds of sub-problems, such that the processing is as much SIMD like as possible in small pieces, and the merging of the subsolutions should not be heavy; or at least be parallelizable. I am thinking of going for (2).







Thursday, March 31, 2011

Starting Off With CUDA

First Few Steps...
About to kick off with CUDA, and I admit I'm pretty excited! Yesterday, I managed to compile and execute my first CUDA program (I'm using Linux), which was basically taken from this blog.
(A)  CUDA tutorial blog I (here)
This blog is good for beginners. From the same blog following posts are helpful:
My first CUDA program
Threads and block and grids, oh my!

(B)  CUDA tutorial blog II (here)
Another good blog. Following posts are found helpful:
Inter-thread communication: explains a code snippet for reduction
Atomic operations: demonstrates why using atomicAdd() (or any AtomicXXX() is a bad idea)


(C)  CUDA Online Course (svn downloadable) by Stanford folks
here


CUDA Helper Libraries
(see the CUDA tools and ecosystem page by NVIDIA)
(A) Thrust Standard Template Library
Thrust is a C++ template library for CUDA. Thrust allows you to program GPUs using an interface similar the C++ Standard Template Library (STL). These templates are a way to write generic algorithms and data structures. The template library is simply a cohesive collection of such algorithms and data structures in a single package, acts as a wrapper around CUDA API.
Downloads
Documentation
FAQ
Note that Thrust requires CUDA 3.0 or newer.

(B) CUDA Utility (CUTIL) 
I Couldn't find any documentation for the library anywhere. However the header file itself (NVIDIA_CUDA_SDK/common/inc/cutil.h) has doxygen-style comments and will tell you exactly what they do.
CUTIL provides functions for:
  • Parsing command line arguments
  • Read and writing binary files and PPM format images
  • Comparing arrays of data (typically used for comparing GPU results with CPU)
  • Timers
  • Macros for checking error codes




Debugging CUDA
Check out the debugging and profiling page on NVIDIA's website.
I am thinking of using totalview mainly because its has GUI, lets me debug both host and device code simultaneously.


Miscellaneous Issues
CUDA FAQ by NVIDIA
This page describes how to dynamically allocate shared memory.
CUDA Data structures (Kdtree, Octree, etc.) implementations
CUDA Kd-tree source-code (which did not extract x-()

Wednesday, February 16, 2011

Underwater Scenes Illumination : Important Resources

Photon Mapping with participating media
Jensen's 1998 paper talks for the first time
Practical guide to global illumination using photon maps is brilliant to understand
I found a code for photon mapping with participating media here.






Miscellaneous
  • This page describes a course project on realistic underwater lighting using photon mapping by Stanford students. Has a source code and sample scene too.
  • This page by Mark Kilgard, talks about OpenGL-rendering of underwater caustics.

[EDIT: March 16, 2011]

[1]The FAQ for photon mapping implementation here and here are excellent.
[2]This source-code for photon mapping is a complete package. (including ray tracing pass).
[3]This source code is Jensen's photon mapping code. (with some little tweaks).
[4]This tutorial on photon mapping by Zack Waters is also excellent and is in the must-refer-to list.

[5]http://www.mpi-inf.mpg.de/departments/d4/teaching/ws200708/cg/rc/ko/



Downloadable and compilable ray tracing implementations available on the Internet
  • Simple Ray tracer (uses Windows-specific headers; doesn't use acceleration structure; fast)
  • RayTrace 3.3 (both platforms; organized code; uses kd-tree; supports .obj and .nff scenes (arbit scenes don't work))
  • raytracer7 (really fast, uses kd-tree)
  • ultimate ray tracer (haven't tried)
  • heatRay (brilliant screenshots; haven't run the code yet)

3D Models Downloads available over the Web


[This list will be updated repeatedly.]

Wednesday, February 2, 2011

Photon Mapping

Important Resources I am using to right now

1) Jensen's Book ("Realistic Image Synthesis Using Photon Mapping")
- He, basically, invented photon mapping.
- Noticeably, The forward of the book (by Pat Hanrahan) cleverly clarifies the big picture about -radiosity, monte carlo ray tracing, and photon mapping and how are they co-related.
- The book also includes a photon mapping implementation.
- The first edition of the book had some errors, corrections to which are described here.


2) The Basic Monte Carlo Integration Technique
- The appendix A of Jensen's book has an excellent description of basic Monte Carlo integration.
- I couldn't find a better description of variance and standard deviation than this one.

3) Ray Tracing and photon mapping implementation here. It is based on this paper. The source code is a little weird ;-) I couldn't make it work in first 10 minutes, So I left it.


What am I planning to do with Photon mapping?

I shortlisted GPU photon mapping, volume rendering (participating media) as something that I would like to do with photon mapping. It turned out that it's going to be somewhat volume rendering, precisely, underwater illumination. I will be maintaining a separate post dedicated to the important resources that talk about underwater illumination ( with photon mapping or even with other techniques, for reference).



A few quick must-know-answer types questions:

1) Photon mapping's light transport notation?

Photon Map Creation Phase : L(S|D)*D
Photon Gathering/ Ray Tracing Phase : LS*E | DS*E


2) Is real time photon mapping possible today? Whats the fastest FPS for fairly complex scenes (remember- one of the motivations of saying NO to radiosity and YES to photon mapping is complex scenes; And one of the motivations of saying NO to Monte Carlo and YES to photon mapping is performance). Get the fastest FPS numbers for both CPU and GPU implementations.


4) On the same lines, photon mapping on GPU--Whats the state of the art?


5) Seems like some people prefer photon mapping, mainly because of caustics. If you try to think why radiosity won't give you caustics, then you would see that caustics involves specular objects, which radiosity is useless for. So, thats fairly straightforward for radiosity.
But, Why Monte Carlo ray tracing would not give you good caustics?
==>
A phd dissertation helped me answer this question to a good degree. It says - "A major drawback of backward raytracing is the inability to adequately handle caustics. These are highly directional global illumination components which occur through specular reļ¬‚ections onto diffuse surfaces. They are often noticeable as iridescent highlights on otherwise dull materials from nearby specular objects. Since a backward raytracer traces rays from the observer toward the light sources it cannot predict the occurrence of caustics, as this would imply knowledge of which paths contribute to the caustic and ultimately reach the light source. Without such knowledge, rendering caustics with a backward raytracer borders on an exercise in futility, as the chances of ļ¬nding paths which contribute to a caustic are miniscule.(sic)"
The dissertation also provides a few visuals to clarify the point. I found those pictures necessary to understand the point completely. Following visual from the same may help


Tuesday, February 1, 2011

Progress II

Finally after spending a few more hours on which parallel radix sort algorithm to implement, I have finally come to the conclusion that I am going to implement Partitioned Parallel Radix Sort. I have gone through the algorithm and started with some basic implementation.
One issue might have to be faced later that the paper is meant for distributed memory architectures whereas openMP is meant for shared memory architectures.

Interesting Readings:
Histogram Sort : To find the size of the buckets before-hand so that we wont waste space by overestimating the bucket size.
This might give you a better idea of what these sorting people mean by a Histogram !

Monday, January 31, 2011

Progress

Well, I found some ( read : very very few ) implementations of bucket sort. Two of them were OpenMP based.
1) http://link.ece.uci.edu/~weiran/documents/eecs221_w08.pdf

2) http://www.moon.sannet.ne.jp/okahisa/sort/node32.html


The Third One:

It uses a library called MCSTL. It can be downloaded from here. It's corresponding research paper can be downloaded from here. Interestingly, my version of Ubuntu, viz. 10.10 or Maverick, does not have support for MCSTL. Even if I downloaded MCSTL and copy pasted the .h files at my "include" folder ( /usr/include ), i did not work somehow. Apparently, there is no way to install MCSTL, so I had to do this ugly copy-paste. Bottom-line, it did not work. Later I found that MCSTL requires g++-4.2.2 which Maverick doesn't support, so, conclusion is, forget MCSTL.

Fortunately, I found an interesting piece of information at the STXXL library page. It says STXXL internally may use MCSTL or libstdc++ parallel mode, where the later being a successor of MCSTL.
So, that's what I need. I need the libstdc++ parallel mode library. This page is a decent piece of documentation about that. Being part of Libstdc++, it has been very well documented at many places.

EDIT: Got some documentation for mcstl.







Other interesting Pages:


Wednesday, January 26, 2011

Frameworks: Monte Carlo RT & Photon Mapping

Monte-Carlo RT is extended concept of Classical ray tracing ( Whitted RT ). And in a few minutes you will know what I mean when I say that.
Classical ray tracing takes one ray, at every intersection of a ray with the scene, a single ray is spawned in three most important ( most significant ) & exact directions: towards the lights, mirror reflections & perfect refractions. This does not handle glossy or diffuse inter-reflections.
Monte-Carlo ray tracing has two main types.
Path tracing takes the approach of having a single ray reflect or refract through the scene, changing direction at each surface intersection. This may sound exactly similar to classical ray tracing, but there is an important difference- Monte-Carlo ray tracing picks us a random direction for each reflection or refraction ray. The direction is obviously not completely random, but is influenced by the BRDF. The BRDF, in some sense, acts like a probability density function. And the BRDFs can be arbitrary. For example, for a glossy surface with a narrow Phong reflection lobe, reflection rays would be much more likely to be spawned near the center of the lobe than anywhere else. Weighing the random direction in this way is called as importance sampling.
Distribution ray tracing spawns many random rays from every surface intersection.

Why distribution ray tracing would give us more photo-realism?
Lets come back to the analogy of BRDF as probability distribution function. This analogy is natural since definition of BRDF essentially is how much (in terms of ratio) of input radiance will be reflected in every specific possible direction. This ratio naturally defines the probability of any given ray getting reflected in that specific direction. So, the point is, BRDF = probability distribution function. So, say given 100 input (i.e. incoming) rays, 40 will go in a preffered mirror reflection dirction and the rest 60 will be reflected in rest of the infinite hemispherical directions. Note that, if we assume that the underlying surface roughly does follow the estimated BRDF, then these rays going in all these directions in all these proportions is physically happening.
Now, if we just ignored 99 of these rays and just followed a single ray in a single specific direction ( which could be random if path tracing, and fixed if classical RT), then, we are clearly losing out on a lot of rays which are physically contributing to the resulting illumination story at that point.
Distribution Ray tracing gives required respect and consideration to these 99 rays. It turns out, that these extra rays help us get closer to the reality, closer to the real picture.
I like that! So why not just use it?
Because, in reality, 99 becomes infinite. And that will make our program finish its computations after lightyears. You might be, but I, personally, am not OK with this. So, how to do distribution ray tracing today?
As you might have guessed, we don't sample all the infinite rays. Instead, we choose some number of random directions. Since we choose, we would like them to be the important ones.

As a final note, I should state this: Monte Carlo ray tracing fully solves Kajiya's rendering equation, given enough time and rays.



Reference: Real-time Rendering, section 9.8.2 ( ray tracing )

Further Reading:

My first parallel program : Bucketsort

Well, let's get to the point. How will we parallelize bucket sort?
What's The Scope? : Figuring out *how much* it can be parallelized?
Radix sort, as we know, is basically, doing Bucket Sort on every digit, from right to left such that least significant digit first. Broadly, can we parallelize the processing of individual digits? That is, given 4 digit integers, can we have 4 threads that independently sort four digits? No. Since, the whole idea of radix sort depends on sorting digits sequentially. That is, we must sort all numbers on least significant digits first. Then sort the resulting array on next-least-significant-digit, get the resulting array, then sort on next-least-significant-digit, and so on.
So, this part, has to be serial.

However, in the inner Bucket Sort, the numbers can be inserted into respective buckets completely independently. Great! But, there is just one problem. Say, there are a million numbers and we use the classic approach of ten buckets. Then, there are going to be numerous collisions, on any given bucket. So can we somehow get around with this?

Idea I
If we have only 10 buckets, and if there are thousands of threads, then there are bound to be a plenty of collisions. One way to avoid these collisions is to somehow increase the number of buckets. How can we increase the number of buckets?
We have 10 buckets because we examine one digit of each number. How many values a single digit can take? Ten. Hence we made 10 buckets. Now, say we examine two digits at a time. Two digits can take values ranging from 00 to 99. i.e. Hundred values. That means, hundred buckets. So, we'll make 100 buckets, and instead of examining one digit at a time, we examine two. Rest remains the same- e.g. considering rightmost ( least significant ) pair of digits in the first level, going towards leftmost end level by level etc.
So, do you get the essence of what we did? We increased buckets, by examining more digits at a time, to reduce number of collisions on any bucket. If we take this approach further, we are basically going towards pure counting sort. Remember counting sort? In counting sort, say if there are numbers not more than 6700, then we create 6700 buckets, and then every number i goes in its bucket meaning i'th bucket. It will have zero collisions when there are no/few repetitions. Zero collisions! What do you pay to get this is- lots of memory.
So, from bird's eye, we see two extremes.
One, is pure bucket sort where we examine 1 digit at a time and have 10 buckets, implying minimum memory foot-print, but high collision.
Two, is pure counting sort where we examine all d digits at a time and we have N_MAX buckets, meaning
huge memory foot-print but sub-zero collision.
Clearly, it makes us think that the best way is the middle way. Can we come up with a adaptive heuristic that tells me "dude if you examine p digits at a time, you'll get best possible balance of memory footprint & collision".


Some one has already spent some time on this!
So obviously I googled. I found this pdf which basically is a project report by Nie Weiran in 2008, which is dedicated to parallelizing radix sort. What Else Would I Want! :)
After I go through the paper which is not more than 3-4 pages, I would post the essence and of-course, my thoughts.



A word on performance measurement
I spent almost 2 hours searching and reading about the performance measurement tools for Linux multi-core environment. There were seemingly-good tools out there, important ones of which have been surveyed here.
For now, I didn't need a very hi-fi tool, I just needed to see the real-time utilization measurement of my CPU cores ( the machine I am using is Intel i5 quad-core ). For this primary purpose, I found "gnome-system-monitor" really good. My system already had it installed, may be it comes by default.


Progress  - I


Well, I found some (very very few :)) implementations of bucket sort. Two of them were OpenMP based.
1) http://link.ece.uci.edu/~weiran/documents/eecs221_w08.pdf
2) http://www.moon.sannet.ne.jp/okahisa/sort/node32.html
3) http://snippets.dzone.com/posts/show/4365


A Note about the the third one
It uses a library called MCSTL. It can be downloaded from here. It's corresponding research paper can be downloaded from here. Interestingly, my version of Ubuntu, viz. 10.10 or Maverick, does not have support for MCSTL. Even if I downloaded MCSTL and copy pasted the .h files at my "include" folder ( /usr/include ), i did not work somehow. Apparently, there is no way to install MCSTL, so I had to do this ugly copy-paste. Bottom-line, it did not work. Later I found that MCSTL requires g++-4.2.2 which Maverick doesn't support, so, conclusion is, forget MCSTL.
Fortunately, I found an interesting piece of information at the STXXL library page. It says STXXL internally may use MCSTL or libstdc++ parallel mode, where the later being a successor of MCSTL. So, that's what I need. I need the libstdc++ parallel mode library. This page is a decent piece of documentation about that. Being part of Libstdc++, it has been very well documented at many places.
I got some documentation of MCSTL here.

Progress II


Finally after spending a few more hours on which parallel radix sort algorithm to implement, I have finally come to the conclusion that I am going to implement Partitioned Parallel Radix Sort. I have gone through the algorithm and started with some basic implementation. One issue might have to be faced later that the paper is meant for distributed memory architectures whereas openMP is meant for shared memory architectures.


Miscellaneous interesting Pages


1. Parallel In-Place Radix Sort Simplified
2. 32 OpenMP Traps For C++ Developers
3. Histogram Sort (To find the size of the buckets before-hand so that we wont waste space by overestimating the bucket size)
4. A better idea of histogram sorting



Sunday, January 16, 2011

Point Rendering

The paper here gives a good background of how point rendering has evolved, in its introduction section.

What is challenging with point-rendering?
Filling the holes. Broadly, there are two ways you fill up the holes. One, is to generate polygonal mesh, by reconstructing the surface of the point-based model and then render then mesh.I In this method, the polygons ( triangles, usually) fill up the gaps and the color/depth values are interpolated within every triangle. So, effectively the gaps are filled. The second way to fill up the holes, is when you don't generate a triangle mesh and directly render points as points. Here, finding and filling up holes is relatively more work.


Triangle Mesh Generation Vs Direct Point Rendering
A few years ago, the 3D- scanner devices were not as fine as they are today. Hence, relatively lesser number of point samples were available. That made surface reconstruction ( building a triangle mesh over the points' surfaces and eventually render the triangles ) much more suitable that direct point rendering, as there were bigger and larger gaps to fill.
However, today, the number of samples that are captured are much larger. The gaps are smaller, and the problem with the surface reconstruction algorithms that are currently there is that, they grow super-linearly with the number of points. ( I didn't quite understand how is this making them bad? :O )
So, though, surface reconstruction methods have been better some time back, Now, direct point rendering has got an upper edge.



Future Readings:

Symposium on Point-Based Graphics is a conference that is dedicated to point-based graphics! Interesting!!

Saturday, January 15, 2011

Point Rendering

The paper here gives a good background of how point rendering has evolved, in its introduction section.

What is challenging with point-rendering?
Filling the holes. Previously, the 3D- scanner devices were not as fine as they are today. Hence, relatively lesser number of point samples were available. That made surface reconstruction ( building a triangle mesh over the points' surfaces and eventually render the triangles ) much more suitable that direct point rendering. However, today, the number of samples that are captured are much larger.

Symposium on Point-Based Graphics is a conference that is dedicated to point-based graphics! Interesting!!


Thursday, January 13, 2011

Sunday, January 9, 2011

Microsoft Kinect

following resources are really important:
Overall Statistical Information
  1. http://en.wikipedia.org/wiki/Kinect has very good statistical data about the device and a solid bunch of references.
Making novel use of Kinect
Point cloud rendering
  1. http://www.blendernation.com/2010/11/26/microsoft-kinect-in-blender-realtime-point-cloud-demonstration/ It talks about making use of Kinect device to capture point cloud and renders it using Blender at 2-15 FPS depending upon the point-cloud quality. The video, was not very impressive, basically because it's a very badly composed video, you can't make out anything from it!!!
  2. http://www.digitalsociety.org/2010/11/kinect-modified-to-capture-true-3d-video/ showcases an impressive video!!!!! A researcher at University of California Davis, Oliver Kreylos has produced one of the most amazing demonstrations of true 3D video using an off-the-shelf Microsoft Kinect.

Potential applications of Kinect
What makes Kinect an amazing device - is - the fact that it is able to capture purely 3D data ( 3D points ) at 30 FPS. Let me say it again more loudly - 30 FPS!
The original purpose of Kinect is clearly in gaming, as the device, when released, has been meant to be used along with XBOX 360. Using Kinect, a user would be able to interact / manipulate with the 3D virtual world, by just making suitable gestures, mostly as if he/she were physically there.
The first use-case of Kinect that comes to my mind is telepresence.

Saturday, January 8, 2011

Limitations of Ray Tracing

An article here, written by Dean Calver at the end of 2007, basically talks about the limitations of ray tracing. The good thing about this article is that it has been written from a neutral, unbiased perspective, which I find very rare.
As this article was written back in 2007-2008, it makes sense to evaluate it again, since things may have changed since then, so some arguments may no more be applicable and some new arguments may step in.
I would love to do that soon and post my views here.

In case the above link doesn't work, use this manual link: http://www.beyond3d.com/content/articles/94/1
I consider it very important that you become a expert person- gain tremendous mastery over one area in life. One single area, but, killer knowledge and expertise. It's really surprising sometimes when people who come from completely different background do miraculous work and/or become really important people in some amazingly different area. The guy who used to manage the whole DirectX graphics group in NVidia (where I worked for a year) was a PhD in nuclear physics. Yes. Or.. umm, a bright student here at IIT Delhi who is doing his PhD in algorithms had done his masters in VLSI! And these people have proved themselves to be much better than other people who may be there in the same field for quite a longer time.
We were discussing today what may be the reasons behind why this might have happened, and very important things came up. One was how the people who have done their masters/PhDs in mathematics or physics become potentially-good-at-any-analytical-field. They have become very strong minds, with really sharp abilities of critical thinking. They know how to attack problems, and give novel kick-ass out-of-the-box solutions. And our second conclusion was "yes, *this* is it"!!! This is exactly what a bunch of good profs, the work you do, your passion with your work, and the company you keep, end up doing to you. They do it to you and make it the person I mentioned above.
Isn't *this* something that you should be doing your PhD, and for that matter, even your masters for?

I started from a very different note than what I ended on...I hope it made some sense to you :)

One step ahead...:)

  • It will be really good idea if we could combine ray tracing and rasterization in an innovative way. Prof. Subodh gave an example of figuring out ambient occlusion at any point by putting a camera at that point looking out, rasterizing the scene from there and somehow ( say, by occlusion culling) find out how much of the rendered depths are more than some "radius" depth. By radius I mean radius of the sphere of influence, occluders inside of which only will matter. Similar to AO, can we make use of rasterization within ray tracing for much more than just using it for primary ray casting.
  • There were two projects he talked about. These two projects are being worked upon in coordination with some German university. On account of which, he mentioned, there would be good chances of doing a summer internship at Germany, to work on these projects. They are,
  1. Capturing and rendering fine surface details: He talked about capturing surface details from real-life photographs and using them. He was talking of achieving effect something similar to bump mapping.
  2. Microsoft Kinect Device: Making novel use of it to do various things: Doing things like ,one, creating point clouds or, two, extracting 3D models from Kinect captures. The device, as he said, basically captures RGB+depth info of what it "sees", -and-here-it-comes- in real-time. That makes it a really useful device.
I heard that the association is with Max Planck university. In his Research Mission, Prof. Dr. Marcus Magnor of the same university, on the university web, mentions- "In recent years, however, progress in rendering algorithms and graphics hardware has led to the insight that, despite faster and more complex rendering calculations, the modeling techniques traditionally employed in computer graphics often fundamentally limit attainable realism. I therefore concentrate on investigating suitable algorithms to import natural world-recorded realism into computer graphics. Such image/video-based modeling and rendering techniques employ conventionally taken photographs or video recordings of the real world to achieve photo-realistic rendering results."
These lines underline the importance of modeling and surface detailing. The same motivation is , in my opinion, behind the first project that Subodh mentioned.

Tuesday, January 4, 2011

Ray tracing in games and movies

Today I was going through HPG 2009 Report. It mentions Larry Gritz, in his key-note, saying that ray tracing making more sense for film rendering than the current REYES technique. He also gives a several reasons behind it. That read, I was wondering really how many and which games/movies use ray tracing for rendering, as of today. Hence this post.

It seems to me that ray tracing has to defeat rasterization in games industry, and, for film rendering, has to defeat REYES. Seems like ray-tracing is having a really hard time these days!
Ray tracing in games
1)
In 2007-08, students of the IGAD course of the NHTV University of Applied Sciences (Breda, The Netherlands) worked on two games using a real-time ray tracer for rendering. Here is the page that describes there work.
Though I did not find any good feedback from that place.

2) Quake 4, Quake Wars and Wolfenstein have been ray traced!
Here are the details.
Quake 3 ray traced : 2004
Quake 4 ray traced : 2006
Quake Wars ray traced : 2008
Wolfenstein ray traced : 2010

All these projects, seemingly, are CPU based.


This article at Intel's research site reports about the Wolf3D getting ray-traced-
  • Effects achieved : reflections, refractions,
  • Scene complexity: around 1 million triangles ( claimed )
  • Performance :
  • CPU/GPU/Hybrid:


Indirect Illumination : The thing that makes global illumination a challange!

What are the methods/techniques that are used to render ( a.k.a. approximate :) ) indirect illumination?
  • Brute Force
  • Irradiance map
  • Light Cache
  • Metropolis light transport ( is the the same as Light cache?)
  • Photon map
  • Path tracing, bidirectional path tracing
  • VPL ( Virtual Point Lights ) (based on the instant radiosity formulation)
  • VSL ( Virtual Spherical Lights )
What are the sub-challenges?
  • Losing out on small, sharp shadows
  • Losing out on glossy inter-reflections
  • noise!
References:
Approaches to indirect illumination used in GI
http://www.spot3d.com/vray/help/150SP1/render_params_gi.htm
http://www.spot3d.com/vray/help/150SP1/examples_GI.htm

Is ray tracing a saturated field?

Look back, briefly!
Ray tracing has been studied for decades. As far as I know, since 1968 (Ray Casting: Appel, 1968). The first paper on Recursive ray tracing came in 1980 (by Whitted). Till around 2000 the ray tracing had always been considered to be significantly slower than conventional rasterization techniques. From around 1997, people have started practically believing in making ray-tracing real-time. Since then a lot of efforts have been put to make it real-time, or at least as fast as rasterization if not faster. There used to be an important conference dedicated to research work in ray tracing known as The Symposium on Interactive Ray Tracing. After 2008 it got combined with another graphics hardware related conference into a single bigger conference known as High-Performance Graphics. The work published in HPG and ,of-course, SIGGRAPH and SIGGRAPH Asia will give you more detailed idea of how and where ray-tracing is going.


What makes me think that Ray Tracing still has opportunities?

Ok, think of it in this way. Why is it difficult to replace rasterization by ray-tracing today? lets list out the reasons... First, of-course, is the dominance of rasterization based hardware. Secondly, many problems have been solved/ partially solved for rasterization. At least, *a huge effort* has been put into rasterization, and that has resulted into really impressive techniques/effects.All in all, these are the primary reasons. Now lets see why would ray-tracing be the future of rendering. Lets consider one reason at a time.Hardware, it seems is slowly but clearly moving towards ray-tracing. Many NVIDIA researchers are working on ray-tracing, and the whole CUDA, GPGPU direction plus OptiX, tell us that NVIDIA is realizing the importance of new graphics hardware being able to support high performance ray-tracing. So,things *are*, changing.I don't know what AMD is doing about it, honestly. That is important, too.Second aspect is important. A lot of techniques have already been developed and a lot of research has already been done, assuming rasterization beneath. Back then, the bias towards rasterization was natural and obvious- because ray-tracing was really slow. Really really slow. May be that is why graphics hardware industry also preferred rasterization.But now, we see that ray tracing can be fast, actually, real-time. There maybe one thing to be verified here. As Prof. Subodh said, it may be that the"real-time" performance seen today is not for very complex scenes, or otherwise for some *specific *highly complex scenes.But the point we should focus our attention on, is that relatively lesser effort has been put up in ray tracing, than in rasterization, and most of the effort put in ray-tracing was all directed to eventually get it real time. Now, think about the rest of the effects. They are so many. We see that 2010 SIGGRAPH has motion blur and defocus within ray-tracing. Like this,there may be many more effects for which ray tracing has not been combined with. THIS is the opportunity.Some opportunities may be related to some effects. E.g. motion blur requires dynamic ADT ( acceleration data structure ) , which I really thought useful.


Here are examples of some attempts that were made in this direction:


What does ray-tracing lack? - Opportunities on the horizon
  • Performance : Push FPS more and more ( bandwidth, packet tracing, SIMD )
  • hybridizing RT with rasterization. To get the best of both worlds
  • dynamic data structures and related effects ( motion blur, etc )
  • Indirect illumination effects
  • how would we achieve good Caustics using ray tracing?
  • soft shadows and glossy reflections
  • Depth of Field? deFocus? Anti-aliasing?
  • micro-polygon RT, stochastic/distributed RT, monte carlo RT