Sort Overview

Click here to hide the discussion of programming techniques

Pipelines

None of the early articles will discuss fancy wrinkles for sorting algorithms, such as "descending" sort orders, or case-insensitive comparison. This is because I shall be layering "order reversal" and "case removal" routines over the sort algorithms, in higher-level routines. The problems and solutions of descending order, case insensitivity, and following the rules for NULL comparison do NOT have to be addressed in the early articles, as they can be solved in the same ways for all basic sort algorithms, by using pipelines. This will be discussed in more detail in article #4, which discusses hybrid and adaptive sorts (where the sort algorithm can be chosen by the caller, or chosen - based on a scan of some of the data - by the sorting script). A typical sorting pipeline might look like this:

Removing Nulls

Other Type Conversions

Converting all strings to upper case

Randomizing input order

Finding asending and descending runs

Bucket Sort

Radix Sort

Quick Sort

Heap Sort

Straight Merge Sort

Natural Merge Sort

Straight Insertion Sort

Bubble Sort

Shell Sort

Making an unstable sort order stable

Reversing the output order

Restoring Nulls and moving them to front or back

Changing the output order of the actual array

Restoring the cases of strings, Other backwards type conversions

While that looks like a lot of steps (and, yes, it is!), all of the pipeline elements except for the sorts themselves will run in time proportional to the number of records. So, for example, if you are sorting text database records that may be null, and you want the sort to ignore case, and display in reverse alphabetical order, you'd:

Make a temporary copy of the column that you wish to sort
Replace every null with a blank string (but record where the Nulls were found!).
Replace every string with a version of itself converted to upper case
Call your sorting routine
Call a different routine that reverses the sort order
Use your knowledge of where the nulls were to sort those records to the top or bottom
...Finally...reorder your original array

Or, more to the point, you'd instruct the sorting program to do all those extra steps for you. And your sorting routine would be able to compare strings like this:

  If a <= b Then 
    x
  ElseIf a = b Then
    y 
  Else 'a>b
    z
  End If

rather than this (which just about makes me lose my lunch - though this is pretty much the way I've been coding my sort routines too, until recently...):

  If IsNull(A) Then
    If IsNull(B) Then
      y
    Else
      z
    End If
  ElseIf IsNull(B) Then
      x
  ElseIf ucase(a) < ucase(b) Then
      x
  ElseIf ucase(a) = ucase(b) Then
      y
  Else
      z
  End If
  '
  'And I left out support for "reverse order sorts"!
  'That'd make it even uglier!
  '

Remember that most sort algorithms have to compare each record with more than one other record, most of the time! So handling the nulls and the upper case and reversing the sort order, etc, outside the sorting routine will make it faster as well as simpler.

Indices

The sorting algorithm implementations given in this series of articles will rarely work directly on the information that is to be sorted. They will almost always act on it through indices. This is not only for performance reasons (although that is a factor, particularly when sorting arrays with a large number of columns). In addition, by swapping indices rather than actual array elements, you can:

Sort non-contiguous subsets of arrays (this doesn't sound important, but just wait until you see article#7!, once I write it).
Pass the sorting routines a subset of only the fields you wish to sort on, or even calculated fields (as I recomend that you implement Null comparison semantics and case insensitivity using calculated fields in a "pre-sort pass" this becomes important. See article article#6, once I write it).
More easily share common steps (and reuse the code) between functions or libraries which are operating on data stored in different formats, or using different sorting algorithms. For example, all descending order sorts are implemented via a "reverse orderer" that runs after the sort itself. See article#6, once I write it. See also the comments on pipelines above.
More easily optimize the code that, in the end, actually reorders the data in the output array. See article#10, once I write it.
There are some other (weird!) advantages to using arrays of indices, but to exploit them requires a lot of permutation theory and I won't be discussing them in the early articles. But, basically, if you've got a sort order on A and another on B you can combine them relatively quickly

Again, there's a downside. Sorting indices rather than the array elements themselves roughly doubles the execution time of a sorting algorithm. However, in some cases you can have your cake and eat it too. If most of the values in the sort key are duplicated many times, it is possible to sort the array elements themselves, and then reverse-engineer the index order from the sorted array order. This sounds weird, I know, but here's how it works:

Keep a copy of the unsorted data (call this dUnsorted)
Sort the data array itself (this is twice as fast as sorting indices into it).
Build a table of the different values found in the sorted data array, in order, storing the count of times each value was found.
Run through each record of dUnsorted, running from left to right, and find its value in the table. From its position in the table you can determine where its index belongs in the output index sort order.

For example, say we're sorting animal records on species name (and we know there are a lot of duplicates for each species name), but we want to save time by shuffling neither the records NOR their indices themselves. We build a copy of the values in the name column, and work from that.

given the input data array	Array("cat","bear","cat","antelope","cat","wolf","bear","dog"),
and the sorted data array	Array("antelope","bear","bear","cat","cat","cat","dog","wolf")
we know the different sorted values are	"antelope" - 1 copy, "bear" - 2 copies,"cat" - three copies,"dog" - one copy,"wolf" - one copy)
so we know, for the first element with any given value in the unsorted data, where it will go:	"antelope" - index 0 , "bear" - index 1,"cat" - index 3,"dog" - index 6,"wolf" - index 7
we can now run through the unsorted array figuring out where each element belongs	The first record in the input array was "cat", so it is 4th in the output index order (belongs at index 3)(because we figured out in the last step how many records were in front of it). We bump up the pointer indicating where in the output order the next value of "cat" should be placed, and keep going.
Eventually... we get the correct index sort order	Array(3,1,6,0,2,4,5,7) which says, for example, that the index of the element that should come first was originally 3 - the record that had a species name of "antelope", and so on.

Use of 1D Arrays

Actual implementations of sort algorithms will work with 1D arrays. Even where multiple columns of a 2D array (or of an array of arrays) are being used as a sort key, the grunt-work will be done with 1D arrays. The reasons for this are:

It improves performance.
Arrays that sort 1D arrays are easier to read (and easier to get right!)
It makes it easier to use mix and match sorting routines, using different routines for different parts of a multi-column key (as outlined in Article#4)
It reduces the amount of information that has to be passed from place to place. This matters in Active EXE servers, but with careful interface design is less important for VBScript/ASP and ActiveX DLLs; and, lastly, don't forget:
You can get around issues like Nulls and case-insensitive comparison (or other conversions) by removing nulls, or converting all the strings to upper case, before calling the sort routine (so you only have to do them once for each record, not every time you compare it to another).
It improves performance.

Unfortunately sorting with indices has a downside:

You have to set aside additional memory to hold the indices
If you are sorting only a single column, which contains values of a basic datatype, such as Integer or Long (or to a lesser extent, Double or Date), more time is spent looking up indices than is saved by not swapping the array elements themselves.

In later articles I'll cover some bent ways to do away with the indices during most sorting. Hint: if you are sorting strings, and you are sorting a 1D array copy of the column you can usually tack on the index temporarily with this:

    d(i) = d(i) & (chr(0) & i)

Then you strip it back off after the sort is complete...

    'd=copy of data, a=array to take indices in sort order
    for i = lbound(d) to ubound(d)
      packed = d(i)
      a(i) = val(mid(packed, instr(1,packed,chr(0))+1))
      'if d weren't just a copy of the original data, you'd also have:
      'd(i) = left(packed, instr(1,packed,chr(0))-1)
    next

Use of types

The routines quoted in these sort articles are written in VB, but not in the VBScript dialect used in ASP. There are two reasons for this:

the routines use the "As" keyword, so that specific datatypes can be used. When indexing into an array, for example, performance is much better if the variable used for the index is a Long, rather than a Variant.
Later some of these routines will be used as templates for versions which are optimized for particular types (for example, for versions of sort routines intended specifically to sort strings, and nothing else.

Despite this, I use Variant more often than Long. Indeed, where I intend to use arrays of Long, I declare variables of type Variant rather than Long(). Here's why:

It's harder to tell if a Long() array is initialized without raising errors. With a Variant you can just check IsEmpty(x) to see if x has been initialized.
How on earth do you pass a default value for an optional Long() parameter? If you know a way, tell me!
There's no performance benefit to be had by using Long(). If anything your code will run slightly slower(!) - by about 2%.

Performance Recording

I didn't use profilers to get timing for any of the sort routines I'll be discussing in these articles. There are a couple of different methods used for performance recording and measurement.

Calls to a timing class are made. I've a VB class that can return dates accurate to about ten microseconds and times accurate to slightly less than a microsecond (at least, on a reasonably fast workstation running Windows NT). The high resolution performance counter is used, not the clunky multimedia timer (which is rather less accurate). Surprisingly, code in the VB IDE runs almost as fast as it does when it is compiled as a standalone EXE (unless you disable array bound checking. More on this later).

To get an idea of how many times each of the basic operations takes place in a given sort run, I build debug versions of the sorting routines - at least in VB, anyway - which include calls to routines such as these:

Routine Name	Comment
CountComparison	Making a comparison between two record values
CountSwap	Exchanging two records or indicies. This generally takes three moves.
CountMove	Copying a record or index from one place to another
CountCall	A procedure call
CountStack	A single stack push or pop
CountPartition	In partitioning sorts (such as Quicksort or MSB-first radix sort) it is sometimes worth recording which partitions are sorted.

When fiddling about with variations on sorting algorithms, it can be quite interesting to see which counts go up and which go down (and how overall performance changes, too). The results can be quite surprising!

Sorting Overview

Introduction

Which is fastest?

Programming Techniques

Pipelines

Indices

Use of 1D Arrays

Use of types

Performance Recording

The Articles