Site hosted by Angelfire.com: Build your free website today!

Radix and bucket sorts

Introduction to Radix Sorts

Radix sorting is a radical alternative to "insertion" and "merging" sorts. Its most important - and most counterintuitive - feature is that it does not hinge on comparisons between records. In some radix sort implementations there are no comparisons between records at all.
Radix sorting is a cousin of the Quicksort. The data rows are partitioned. However, rather than choosing an arbitrary element as the pivot value, a radix sort algorithm chooses values and then partitions the array.
There are two different ways to conduct radix sorting:
1. You may sort on the most significant part of the key *first*, dividing the data into smaller arrays, which are then sorted in turn. This an MSB radix sort
2. An alternative is to sort all of the data by the least significant part of the key, then again, with the next least significant, and so on, until the entire key has been used. For example, if you were sorting the numbers 0-7, you would first sort for odd and even, then sort into two sublists based on (n And 2) (0,1,4,5) in the left, then again based on (n and 4) (0,1,2,3) in the left hand sublist. Surprisingly, after these three sorting passes, the rows would be ordered by the column.
Radix sorting is fantastic at sorting huge volumes of data. When the number of records, N, gets big enough, and there are plenty of values to sort, it's great. It was originally invented for the U.S. Bureau of the Census, and its one of the few sort algorithms that is easy to implement mechanically. The U.S. Census did it with punched cards!
Radix sorting works well with heavily duplicated data (for example, "Sex" columns in tables of data about customers or clients). It also performs well when there are relatively few different possible values for the column being sorted.
When sorting large numbers of records by a column with only small number of tightly grouped values, radix sort performs much better than Quicksort. To sort on a "sex" column, for example, where only two values are allowed, it gets the right answer in N comparisons. For a column where the values 0,1,2,3,4,5,6,7 are allowed, it takes 3N comparisons. This is much better than Quicksort's N*LOG(N) average time. For a table with a million records (in random order) to be sorted by a column with the values 0,1,2,3,4,5,6,7 (with each value equally likely) a Radix sort is almost certain to run faster than Quicksort. Probably much faster, because when Quicksort picks its first partition element it's as likely to pick 0 or 7 as 4. The general rule is that a Radix sort takes bN comparisons, where b is the number of bits in the key.
...which is its problem, too. If, say, your data only has seven values, but they are 1,2,4,8,16,32 and 64, a one-bit-at-a-time Radix sort would have to read through the entire file seven times to sort them. A properly tuned Quicksort might well get there by comparing each record with on average, about four others. This is why Radix sorts don't do well on strings, for example, which tend to carry a lot of "Junk DNA" along with the bits that matter. After all, it's not until you've looked at 42 bits (in ASCII, more than twice that for Unicode) that you can tell apart probe and problem.
However, radix sort is good at sorting random numbers (particularly if they're just as random in each bit). And it can be good at finding you the records that come first... first (if it sorts by most significant bit first).
Later in this article I'll discuss how a variety of Radix sorting *can* be used even for sorting free text strings (though it's really only suitable when you've a hell of a lot of records to sort), and we'll also discuss methods for
1. compacting keys so that they have fewer bits
2. sorting more bits at a time without too much overhead

Radix Sorts

This section will discuss different types of Radix sort algorithms.

Binary Radix Sorts

N-bit-at-a-time Radix Sorts

It is not necessary to sort on 1 bit at a time. If you are prepared to build more lists, in the division stage, you can sort an arbitrary number of bits in each pass. In this section I shall outline an 8-bit-at-a-time radix sort geared for sorting arrays of strings, which uses a Most-Significant-Byte first divide-and-conquer algorithm and a strange little trick for skipping over the "missed values" for each byte.

The Bucket Sort

A bucket sort is a cousin of the radix sort. The idea behind a bucket sort is that, sometimes, you have a pretty good idea of what some of the high level partition values ought to be. For example, if you are sorting phone numbers, you sort by dividing the input array into "buckets" according to the value of the first digit of the phone number (assuming that 0 through 9 are equally likely). A better example, though, is surnames. When I was in Highschool they used to divide the students into "houses" based on surname range. If your surname was between A(<=) and (<)DB, you were in one house, if it was between DB(<=) and (<)G you were in another, if it was between G(<=) and (<)K you were in a third, and if it started with K (or a later letter in the alphabet), you were in the fourth.
This sort has a lot of names. I've also heard it called the Post Box sort. Some versions use address calculation (this works well if you've a reasonably uniform distribution in a key). For the example I gave before with 0,1,2,3,4,5,6,7, an address calculation sort (or "distribution counting" sort) will give the answer faster than anything else. Full stop.

An example bucket sort implementation

...Just as soon as I write one. The only one I ever wrote was in C++, back in 1997, and I've lost the source code!

The Unique Value Sort

Unique Value Sorts are best used when there are a lot of duplicates in the input data. Sometimes Unique Value Sorts are confused with Bucket sorts in the literature. But the Bucket sort was originally developed for sorting by value range, and I want to distinguish it from sorting algorithms which assume there are only a few different values in the input, and compare each record, not with ranges, but with exact values. I've written a couple of Unique Value Sorts but they're not very efficient for dealing with more than about ten distinct values (they use Straight Insertion on an array of the different values found so far, whenever they encounter a new value), and I haven't put in anything yet for them to "downgrade" to a more appropriate sorting algorithm if and when it becomes apparent that the input has too much variety (this means that my versions will take time proportional to the square of the number of input records, if all the records are distinct, because they will effectively be running a Straight Insertion sort on all of the records).
The algorithm I've been using reads like this:

Create an auxiliary table,V, that is to list the distinct values found in the input, sorted by the order they should appear in the output.
Create another auxiliary table,B, that, for each different value, indicates in what order it was found (example, if you read "a","c","a","b","c","b","a" you'd have "a" mapped to 0, "b" mapped to 2, "c" mapped to 1).
Create a third auxiliary table,C, that indicates how many times each distinct value has been found.
Create a fourth auxiliary table,L, that is to hold the "order my value was found" for each element in the input array.
Copy the value of the first record into V, set the first record in B to 0, the first record in C to 1, and the first record in L to 0.
For each succeeding record, binary search the table of values found so far to see if its value has already been encountered

If the value is there, bump up its count in table C by one and look up its original arrival order in B. Write that into the same record in L
If the value is not there, make room for it in the auxiliary table, V, by moving records down. Likewise, make room in B, C, V and L (by doing the same moves) (this step is why if there are lots of distinct values, this approach behaves like an Insertion Sort. It is doing an Insertion Sort, but not on the input data, rather on the different values found in the input data. Write the current size of V into the new record in L, and write the value itself into the new record in V.

The fiddly bit. We can now reorder the input array using the information in V,B,C and L. For the example I gave above, they would be:

V	a	b	c	Values found					We have a problem! We've got the order that the different values were found, and how many of each there were. But we want the output in value order, not in the order each value was found in the unsorted input. Fortunately there's a trick that will help us.
B	0	2	1	Order they were found
C	3	2	2	How many found

L	0	1	0	2	1	2	0	Order in which each input element's value was found

If we find the inverse permutation of B, we will know, for each value (in terms of order of appearance), where it belongs in the output order (in this case B^-1 is the same as B, but usually it wont be).

B^-1 0 2 1 Order in which they belong
From B^-1 and C we can calculate where the first record with each distinct value ought to go. We build an array that gives the location (in terms of the appearance order numbers used in L), by going through C in the order given by B^-1, and recording the totals of the counts (from C) looked at so far - not including the current one - to build X:

X 0 5 3 Where the next one goes
Don't worry! We're nearly done. Now we go through L, and look up the location for the next element in X. Each time we do this we bump up the value in the element of X by 1. I'll illustrate for the each record (with A the output array that indicates the final order, in terms of the original order).
- Element 0 of L is 0. X(0) is 0. So A(0) is 0. Set X(0) to 1.
- Element 1 of L is 1. X(1) is 5. So A(5) is 1. Set X(1) to 6.
- Element 2 of L is 0. X(0) is 1. So A(1) is 2. Set X(0) to 2.
- Element 3 of L is 2. X(2) is 3. So A(3) is 3. Set X(2) to 4.
- Element 4 of L is 1. X(1) is 6. So A(6) is 4. Set X(1) to 7.
- Element 5 of L is 2. X(2) is 4. So A(4) is 5. Set X(2) to 5.
- Element 6 of L is 0. X(0) is 2. So A(2) is 6. Set X(0) to 3.
After that, A maps rows of the output back to their original indices in the input, like this... [To do: insert IFRAME/LAYER with graph]

B^-1	0	2	1	Order in which they belong

X	0	5	3	Where the next one goes

It's possible to build much simpler Unique Value sorts that, say, are given a list of the values they can expect in the data. The algorithm I've outlined above does not need a list of values; it finds them in the input data. But it will run like a dog if there are more than, say, 10 different values in the input data. I should really be using a balanced binary tree to store the values found so far! Then it would scale rather better. But for that I'll need to know how to build and maintain balanced binary trees! And I haven't read that part of Knuth yet! More's the pity.