Click here to hide the
discussion of programming techniques
Pipelines
None of the early articles will discuss fancy wrinkles
for sorting algorithms, such as "descending" sort orders, or
case-insensitive comparison. This is because I shall be layering
"order reversal" and "case removal" routines
over the
sort algorithms, in
higher-level routines. The problems
and solutions of descending order, case insensitivity, and
following the rules for NULL comparison do NOT have to be addressed
in the early articles, as they can be solved in the
same ways for
all basic
sort algorithms, by using pipelines. This will be discussed in
more detail in
article #4, which discusses hybrid and adaptive sorts
(where the sort algorithm can be chosen by the caller, or chosen -
based on a scan of some of the data - by the sorting script). A typical
sorting pipeline might look like this:
| Removing Nulls |
| Other Type Conversions |
| Converting all strings to upper case |
| Randomizing input order |
Finding asending and descending runs |
| Bucket Sort |
Radix Sort |
Quick Sort |
Heap Sort |
Straight Merge Sort |
Natural Merge Sort |
Straight Insertion Sort |
Bubble Sort |
Shell Sort |
|
| Making an unstable sort order stable |
| Reversing the output order |
| Restoring Nulls and moving them to front or back |
| Changing the output order of the actual array |
| Restoring the cases of strings,
Other backwards type conversions |
While that looks like a lot of steps (and, yes, it is!), all of the pipeline elements
except for the sorts themselves will run in time proportional to the number of records.
So, for example, if you are sorting text database records that may be null, and you want
the sort to ignore case, and display in reverse alphabetical order, you'd:
- Make a temporary copy of the column that you wish to sort
- Replace every null with a blank string (but record where the Nulls were found!).
- Replace every string with a version of itself converted to upper case
- Call your sorting routine
- Call a different routine that reverses the sort order
- Use your knowledge of where the nulls were to sort those records to the
top or bottom
- ...Finally...reorder your original array
Or, more to the point, you'd instruct the sorting program to do all those extra steps
for you. And your sorting routine would be able to compare strings like this:
If a <= b Then
x
ElseIf a = b Then
y
Else 'a>b
z
End If
rather than this (which just about makes me lose my lunch - though this is pretty
much the way I've been coding my sort routines too, until recently...):
If IsNull(A) Then
If IsNull(B) Then
y
Else
z
End If
ElseIf IsNull(B) Then
x
ElseIf ucase(a) < ucase(b) Then
x
ElseIf ucase(a) = ucase(b) Then
y
Else
z
End If
'
'And I left out support for "reverse order sorts"!
'That'd make it even uglier!
'
Remember that most sort algorithms have to compare each record with more than
one other record, most of the time! So handling the nulls and the upper case
and reversing the sort order, etc,
outside the sorting routine will make
it faster as well as simpler.
Indices
The sorting algorithm implementations given in this series of
articles will
rarely work directly on the information that
is to be sorted. They will
almost always act on it through
indices. This is not only for performance reasons (although that
is a factor, particularly when sorting arrays with
a large number of columns). In addition, by swapping indices rather
than actual array elements, you can:
- Sort non-contiguous subsets of arrays (this doesn't sound
important, but just wait until you see article#7!, once I write it).
- Pass the sorting routines a subset of only the fields you
wish to sort on, or even calculated fields (as I
recomend that you implement Null comparison semantics and case insensitivity using
calculated fields in a "pre-sort pass"
this becomes important. See article article#6, once I write it).
- More easily share common steps (and reuse the code)
between functions or libraries which are operating on
data stored in different formats, or using different sorting
algorithms. For example, all descending
order sorts are implemented via a "reverse orderer" that runs after
the sort itself. See article#6, once I write it.
See also the comments on pipelines above.
- More easily optimize the code that, in the end,
actually reorders the data in the output array.
See article#10, once I write it.
- There are some other (weird!) advantages to using arrays of indices, but
to exploit them requires a lot of permutation theory and I won't be discussing
them in the early articles. But, basically, if you've got a sort order on A
and another on B you can combine them relatively quickly
Again, there's a downside. Sorting indices rather than the array elements themselves
roughly doubles the execution time of a sorting algorithm. However, in some cases you
can have your cake and eat it too. If most of the values in the sort key are
duplicated many times, it is possible to sort the array elements themselves, and
then
reverse-engineer the index order from the sorted array order. This
sounds weird, I know, but here's how it works:
- Keep a copy of the unsorted data (call this dUnsorted)
- Sort the data array itself (this is twice as fast as sorting indices into it).
- Build a table of the different values found in the sorted data array,
in order, storing the count of times each value was found.
- Run through each record of dUnsorted, running from left to right, and find
its value in the table. From its position in the table you can determine
where its index belongs in the output index sort order.
For example, say we're sorting animal records on species name (and we know there are
a lot of duplicates for each species name), but we want to save
time by shuffling neither the records NOR their indices themselves. We build a copy
of the values in the name column, and work from that.
| given the input data array |
Array("cat","bear","cat","antelope","cat","wolf","bear","dog"), |
| and the sorted data array |
Array("antelope","bear","bear","cat","cat","cat","dog","wolf") |
| we know the different sorted values are |
"antelope" - 1 copy, "bear" - 2 copies,"cat" - three copies,"dog" - one copy,"wolf" - one copy) |
| so we know, for the first element with any given value in the unsorted data, where it will go: |
"antelope" - index 0 , "bear" - index 1,"cat" - index 3,"dog" - index 6,"wolf" - index 7 |
| we can now run through the unsorted array figuring out where each element belongs |
The first record in the input array was "cat", so it is 4th in the output index order
(belongs at index 3)(because we figured out in the last step how many records were in front of it).
We bump up the pointer indicating where in the output order the next value of "cat" should
be placed, and keep going. |
| Eventually... we get the correct index sort order |
Array(3,1,6,0,2,4,5,7) which says, for example, that the index of the element that should come first
was originally 3 - the record that had a species name of "antelope", and so on. |
Use of 1D Arrays
Actual implementations of sort algorithms will work with 1D arrays.
Even where multiple columns of a 2D array (or of an array of arrays)
are being used as a sort key, the grunt-work will be done with
1D arrays. The reasons for this are:
- It improves performance.
- Arrays that sort 1D arrays are easier to read (and easier
to get right!)
- It makes it easier to use mix and match sorting routines,
using different routines for different parts of a multi-column
key (as outlined in Article#4)
- It reduces the amount of information that has to be passed from
place to place. This matters in Active EXE servers, but with careful
interface design is less important for VBScript/ASP and ActiveX DLLs;
and, lastly, don't forget:
- You can get around issues like Nulls and case-insensitive comparison
(or other conversions) by removing nulls, or converting all the strings
to upper case, before calling the sort routine (so you only have
to do them once for each record, not every time you compare it
to another).
- It improves performance.
Unfortunately sorting with indices has a downside:
- You have to set aside additional memory to hold the indices
- If you are sorting only a single column, which contains values of a
basic datatype, such as Integer or Long (or to a lesser extent, Double
or Date), more time is spent looking up indices than is saved by
not swapping the array elements themselves.
In later articles I'll cover some
bent ways to do away with
the indices during most sorting. Hint: if you are sorting strings,
and you are sorting a 1D array
copy of the column
you can usually tack on the index temporarily with this:
d(i) = d(i) & (chr(0) & i)
Then you strip it back off after the
sort is complete...
'd=copy of data, a=array to take indices in sort order
for i = lbound(d) to ubound(d)
packed = d(i)
a(i) = val(mid(packed, instr(1,packed,chr(0))+1))
'if d weren't just a copy of the original data, you'd also have:
'd(i) = left(packed, instr(1,packed,chr(0))-1)
next
Use of types
The routines quoted in these sort articles are written in VB,
but
not in the VBScript dialect used in ASP. There are
two reasons for this:
- the routines use the "As" keyword, so that
specific datatypes can be used. When indexing into an array,
for example, performance is much better if the variable used for
the index is a Long, rather than a Variant.
- Later some of these routines will be used as
templates for versions which are optimized for particular
types (for example, for versions of sort routines intended
specifically to sort strings, and nothing
else.
Despite this, I use Variant more often than Long. Indeed, where I intend
to use arrays of Long, I declare variables of type Variant rather than Long().
Here's why:
- It's harder to tell if a Long() array is initialized without raising errors.
With a Variant you can just check IsEmpty(x) to see if x has been
initialized.
- How on earth do you pass a default value for an optional Long()
parameter? If you know a way, tell me!
- There's no performance benefit to be had by using Long(). If anything
your code will run slightly slower(!) - by about 2%.
Performance Recording
I didn't use profilers to get timing for any of the sort routines
I'll be discussing in these articles. There are a couple of different
methods used for performance recording and measurement.
- Calls to a timing class are made. I've a VB class that
can return dates accurate to about ten microseconds and
times accurate to slightly less than a microsecond (at least,
on a reasonably fast workstation running Windows NT).
The high resolution performance counter is used, not
the clunky multimedia timer (which is rather less accurate).
Surprisingly, code in the VB IDE runs almost as fast as it
does when it is compiled as a standalone EXE (unless you
disable array bound checking. More on this later).
- To get an idea of how many times each of the basic operations
takes place in a given sort run, I build debug versions of
the sorting routines - at least in VB, anyway - which include
calls to routines such as these:
| Routine Name |
Comment |
| CountComparison | Making a comparison
between two record values |
| CountSwap | Exchanging two records or
indicies. This generally takes three moves. |
| CountMove | Copying a record or index from
one place to another |
| CountCall | A procedure call |
| CountStack | A single stack push or pop |
| CountPartition | In partitioning sorts
(such as Quicksort or MSB-first radix sort) it is
sometimes worth recording which partitions are sorted. |
When fiddling about with variations on sorting algorithms,
it can be quite interesting to see which counts go up and which
go down (and how overall performance changes, too).
The results can be quite surprising!