Wednesday, January 26, 2011

My first parallel program : Bucketsort

Well, let's get to the point. How will we parallelize bucket sort?
What's The Scope? : Figuring out *how much* it can be parallelized?
Radix sort, as we know, is basically, doing Bucket Sort on every digit, from right to left such that least significant digit first. Broadly, can we parallelize the processing of individual digits? That is, given 4 digit integers, can we have 4 threads that independently sort four digits? No. Since, the whole idea of radix sort depends on sorting digits sequentially. That is, we must sort all numbers on least significant digits first. Then sort the resulting array on next-least-significant-digit, get the resulting array, then sort on next-least-significant-digit, and so on.
So, this part, has to be serial.

However, in the inner Bucket Sort, the numbers can be inserted into respective buckets completely independently. Great! But, there is just one problem. Say, there are a million numbers and we use the classic approach of ten buckets. Then, there are going to be numerous collisions, on any given bucket. So can we somehow get around with this?

Idea I
If we have only 10 buckets, and if there are thousands of threads, then there are bound to be a plenty of collisions. One way to avoid these collisions is to somehow increase the number of buckets. How can we increase the number of buckets?
We have 10 buckets because we examine one digit of each number. How many values a single digit can take? Ten. Hence we made 10 buckets. Now, say we examine two digits at a time. Two digits can take values ranging from 00 to 99. i.e. Hundred values. That means, hundred buckets. So, we'll make 100 buckets, and instead of examining one digit at a time, we examine two. Rest remains the same- e.g. considering rightmost ( least significant ) pair of digits in the first level, going towards leftmost end level by level etc.
So, do you get the essence of what we did? We increased buckets, by examining more digits at a time, to reduce number of collisions on any bucket. If we take this approach further, we are basically going towards pure counting sort. Remember counting sort? In counting sort, say if there are numbers not more than 6700, then we create 6700 buckets, and then every number i goes in its bucket meaning i'th bucket. It will have zero collisions when there are no/few repetitions. Zero collisions! What do you pay to get this is- lots of memory.
So, from bird's eye, we see two extremes.
One, is pure bucket sort where we examine 1 digit at a time and have 10 buckets, implying minimum memory foot-print, but high collision.
Two, is pure counting sort where we examine all d digits at a time and we have N_MAX buckets, meaning
huge memory foot-print but sub-zero collision.
Clearly, it makes us think that the best way is the middle way. Can we come up with a adaptive heuristic that tells me "dude if you examine p digits at a time, you'll get best possible balance of memory footprint & collision".

Some one has already spent some time on this!
So obviously I googled. I found this pdf which basically is a project report by Nie Weiran in 2008, which is dedicated to parallelizing radix sort. What Else Would I Want! :)
After I go through the paper which is not more than 3-4 pages, I would post the essence and of-course, my thoughts.

A word on performance measurement
I spent almost 2 hours searching and reading about the performance measurement tools for Linux multi-core environment. There were seemingly-good tools out there, important ones of which have been surveyed here.
For now, I didn't need a very hi-fi tool, I just needed to see the real-time utilization measurement of my CPU cores ( the machine I am using is Intel i5 quad-core ). For this primary purpose, I found "gnome-system-monitor" really good. My system already had it installed, may be it comes by default.

Progress  - I

Well, I found some (very very few :)) implementations of bucket sort. Two of them were OpenMP based.

A Note about the the third one
It uses a library called MCSTL. It can be downloaded from here. It's corresponding research paper can be downloaded from here. Interestingly, my version of Ubuntu, viz. 10.10 or Maverick, does not have support for MCSTL. Even if I downloaded MCSTL and copy pasted the .h files at my "include" folder ( /usr/include ), i did not work somehow. Apparently, there is no way to install MCSTL, so I had to do this ugly copy-paste. Bottom-line, it did not work. Later I found that MCSTL requires g++-4.2.2 which Maverick doesn't support, so, conclusion is, forget MCSTL.
Fortunately, I found an interesting piece of information at the STXXL library page. It says STXXL internally may use MCSTL or libstdc++ parallel mode, where the later being a successor of MCSTL. So, that's what I need. I need the libstdc++ parallel mode library. This page is a decent piece of documentation about that. Being part of Libstdc++, it has been very well documented at many places.
I got some documentation of MCSTL here.

Progress II

Finally after spending a few more hours on which parallel radix sort algorithm to implement, I have finally come to the conclusion that I am going to implement Partitioned Parallel Radix Sort. I have gone through the algorithm and started with some basic implementation. One issue might have to be faced later that the paper is meant for distributed memory architectures whereas openMP is meant for shared memory architectures.

Miscellaneous interesting Pages

1. Parallel In-Place Radix Sort Simplified
2. 32 OpenMP Traps For C++ Developers
3. Histogram Sort (To find the size of the buckets before-hand so that we wont waste space by overestimating the bucket size)
4. A better idea of histogram sorting

1 comment:

  1. The world's fastest (parallel) radiax sort:

    The paper:

    Before going through the above paper (which might be too complex) and implementation, I would recommend going through this: