Site hosted by Angelfire.com: Build your free website today!

Radix and bucket sorts

Introduction to Radix Sorts

Radix Sorts

This section will discuss different types of Radix sort algorithms.

Binary Radix Sorts

N-bit-at-a-time Radix Sorts

It is not necessary to sort on 1 bit at a time. If you are prepared to build more lists, in the division stage, you can sort an arbitrary number of bits in each pass. In this section I shall outline an 8-bit-at-a-time radix sort geared for sorting arrays of strings, which uses a Most-Significant-Byte first divide-and-conquer algorithm and a strange little trick for skipping over the "missed values" for each byte.

The Bucket Sort

A bucket sort is a cousin of the radix sort. The idea behind a bucket sort is that, sometimes, you have a pretty good idea of what some of the high level partition values ought to be. For example, if you are sorting phone numbers, you sort by dividing the input array into "buckets" according to the value of the first digit of the phone number (assuming that 0 through 9 are equally likely). A better example, though, is surnames. When I was in Highschool they used to divide the students into "houses" based on surname range. If your surname was between A(<=) and (<)DB, you were in one house, if it was between DB(<=) and (<)G you were in another, if it was between G(<=) and (<)K you were in a third, and if it started with K (or a later letter in the alphabet), you were in the fourth.
This sort has a lot of names. I've also heard it called the Post Box sort. Some versions use address calculation (this works well if you've a reasonably uniform distribution in a key). For the example I gave before with 0,1,2,3,4,5,6,7, an address calculation sort (or "distribution counting" sort) will give the answer faster than anything else. Full stop.

An example bucket sort implementation

...Just as soon as I write one. The only one I ever wrote was in C++, back in 1997, and I've lost the source code!

The Unique Value Sort

Unique Value Sorts are best used when there are a lot of duplicates in the input data. Sometimes Unique Value Sorts are confused with Bucket sorts in the literature. But the Bucket sort was originally developed for sorting by value range, and I want to distinguish it from sorting algorithms which assume there are only a few different values in the input, and compare each record, not with ranges, but with exact values. I've written a couple of Unique Value Sorts but they're not very efficient for dealing with more than about ten distinct values (they use Straight Insertion on an array of the different values found so far, whenever they encounter a new value), and I haven't put in anything yet for them to "downgrade" to a more appropriate sorting algorithm if and when it becomes apparent that the input has too much variety (this means that my versions will take time proportional to the square of the number of input records, if all the records are distinct, because they will effectively be running a Straight Insertion sort on all of the records).
The algorithm I've been using reads like this:
  1. Create an auxiliary table,V, that is to list the distinct values found in the input, sorted by the order they should appear in the output.
  2. Create another auxiliary table,B, that, for each different value, indicates in what order it was found (example, if you read "a","c","a","b","c","b","a" you'd have "a" mapped to 0, "b" mapped to 2, "c" mapped to 1).
  3. Create a third auxiliary table,C, that indicates how many times each distinct value has been found.
  4. Create a fourth auxiliary table,L, that is to hold the "order my value was found" for each element in the input array.
  5. Copy the value of the first record into V, set the first record in B to 0, the first record in C to 1, and the first record in L to 0.
  6. For each succeeding record, binary search the table of values found so far to see if its value has already been encountered
  7. The fiddly bit. We can now reorder the input array using the information in V,B,C and L. For the example I gave above, they would be:
     
    Vabc Values found   We have a problem! We've got the order that the different
    values were found, and how many of each there were.
    But we want the output in value order,
    not in the order each value was found in the unsorted input.
    Fortunately there's a trick that will help us.
    B021 Order they were found
    C322 How many found
    L010 2120 Order in which each input
    element's value was found
     
  8. If we find the inverse permutation of B, we will know, for each value (in terms of order of appearance), where it belongs in the output order (in this case B-1 is the same as B, but usually it wont be).
     
    B-1021 Order in which they belong
     
  9. From B-1 and C we can calculate where the first record with each distinct value ought to go. We build an array that gives the location (in terms of the appearance order numbers used in L), by going through C in the order given by B-1, and recording the totals of the counts (from C) looked at so far - not including the current one - to build X: 
     
    X053 Where the next one goes

  10. Don't worry! We're nearly done. Now we go through L, and look up the location for the next element in X. Each time we do this we bump up the value in the element of X by 1. I'll illustrate for the each record (with A the output array that indicates the final order, in terms of the original order).
     
    After that, A maps rows of the output back to their original indices in the input, like this... [To do: insert IFRAME/LAYER with graph]
It's possible to build much simpler Unique Value sorts that, say, are given a list of the values they can expect in the data. The algorithm I've outlined above does not need a list of values; it finds them in the input data. But it will run like a dog if there are more than, say, 10 different values in the input data. I should really be using a balanced binary tree to store the values found so far! Then it would scale rather better. But for that I'll need to know how to build and maintain balanced binary trees! And I haven't read that part of Knuth yet! More's the pity.