Site hosted by Angelfire.com: Build your free website today!

Merge Sorts

Introduction

Quicksort and Bubblesort work by determining the correct position of individual elements. But there are other ways to sort.
One of the alternatives is to search by merging. A merge is an operation carried out on two (or more) lists. The idea is, given two lists that are known to be sorted, we can determine a merged list, which contains all of the elements from the two input lists, in the correct order.
We shall be examining two different sort algorithms that work by merging lists. First, we shall cover the Natural Merge, which attempts to exploit existing ascending and/or descending runs in the input data. Then, we shall cover the 2-Way Straight Merge, which does not make use of ascending or descending runs.
Some interesting properties of merge sort algorithms are:
1. They require more memory than insertion and positional sorting algorithms
2. On average, they require fewer comparisons, but more row swaps
3. Unlike many of the faster positional sorting algorithms, they are keyed.
4. Their worst case is usually not much worse than their best case. Merge sorts (particularly the straight merge) are good when you don't want to be worry about the characteristics of the input array.
5. It is relatively easy to develop "co-routines" that can exploit mergesort techniques on incoming data, without knowing how much there will be (it will be a long time before I discuss sorting co-routines, unfortunately. I haven't even decided on an article number yet!).

Natural Merge

Well, I haven't even read this bit in Knuth. Let alone understood it. Let alone written one!

Straight Merge

The diagram at left shows a typical partitioning scheme for Straight Merge (sorting 13 elements). Imagine each box represents a portion of the array that is known to be sorted (initially each box contains only one element!). In each merging pass, we merge adjacent sorted areas. In a simple merging algorithm, where the lists are about the same size (as they are here), you compare elements one at a time from the two sorted portions.

In Knuth's "The Art of Computer Programming", volume 2, in the discussion of Quicksort, he mentions that, when a Quicksort run is taking longer than expected, it may be worth switching over to a Straight Mergesort (because when Quicksort goes bad it goes really bad).
Usually, a Straight Merge is slower than Quicksort, because it requires more memory to keep track of what is to go into each merged partition. But it has a big advantage over Quicksort - its worse case is hardly any worse than its best case. And its best case is none too shabby.

It is, however, easy to see roughly how long a Straight Merge is going to take, because the partition boundaries are fixed. There's some variability. When one of the two lists going into a partition runs out, it's quicker to copy the remaining elements (you don't have to do any more comparisons), so Straight Merge does slightly (but only slightly) better for data already mostly in the ascending or descending order (If you think your data will be mostly sorted, see the Natural Merge, which is really good at that, and the List Merge, which is even better. The Straight Insertion or the much-maligned bubble sort won't do too badly either).

Here's a link to an example Mergesort include file (this is an ASP implementation), which sorts arrays of variant arrays, using their indices. It's a good example of how NOT to write sorting routines. My apologies. It tries to handle Nulls and case sensitive comparisons (which makes it quite a bit longer - without all that the CompareVariants routine could have been pretty much replaced with "<"). It also allows you to specify a multiple column sort key. Given that - as I've impemented it there - Straight Mergesort is a stable sort that wasn't necessary. To sort by columns A and B, you can merely sort by B and then by A.

(Note that the textbook version of Straight Mergesort, as given by Knuth, is not stable, because it has the "left-over" bits in the centre and merges from both ends. The textbook version is a little faster than mine, but I wanted stability.]

The routines that really matter are

TwoWayMergeSubArrays(), which merges two sorted sub-arrays
ReorderTargetByRef(), which given a target array and an array of indices indicating how it is to be reordered, shuffles its order. In later articles in this series I'll cover the major performance advantages you can get (in VB, if not in VBScript!)
SimpleMergesortArrayOfArrays(), which is an iterative implementation of a Straight Merge Sort. It uses two arrays to hold indices. At any given time, one holds the result of the last merging run (horizontal row in the pink table above), and the other holds the partially complete results of the current run. In VBScript, using two arrays means the routine is twice as long. In VB, it is possible to swap the values of variants using diabolical API functions, which speeds things up considerably, and requires less code.

List Merge

List merge attempts to cut down on the amount of copying of elements done by a Straight Merge (which ends up moving most elements - or index, if the sort is indexed - about Ceiling(logN)+1 times), by using linked allocation. Instead of moving records about it merges linked lists (this has about the same effect as cutting down on the moves by half).

On some computers, in some languages, a well-written List Merge will run faster than Quicksort. And it will have a much better worst case.

Exploitative List Merge

This is a beast of my own crafting that attempts to exploit long runs in the input (it's suited to situations where, for example, you had a sort order, and only a very few records have since been updated, but you don't know which ones).

There are two ways to write a List Merge. In one of them you don't try to exploit any existing order in the input array. That's the safe one. It's just like Straight Merge only faster.
If you try to exploit ordering already present in the input data, you might get bitten. List Merge and Quicksort can exploit different kinds of order in the array. Quicksort does well if the data are in "mostly" ascending order (but it's not important that values near each other be in order). An "exploitative" List Merge, on the other hand, can be quite good when there are a lot of long runs (elements going up or down monotonically) in the array it is sorting.
Where the Exploitative List Merge falls down, however, is that a linked list structure isn't much use for finding the spots where the elements in a short list belong in a much longer list. You have to look at all the possible places one at a time, starting from the head of the list.
To try to avoid situations where short lists are being merged into long ones, repeatedly (which could, in theory, give you, in the worst case, pretty rotten performance - consider what happens when your run lengths are, say, 64,1,16,1,4,1,16,1,64,1 - if you just merged adjacent lists it'd take on average 32.5+8.5+2.5+8.5+32.5=94.5 comparisons to get to 65,17,5,17,64, and then you've still got lots of very different sized lists to merge. You get 82,22,64, and only then 82,86), try to merge lists of about the same length. My approach has been to sort the lists by the log of their length and start merging the small ones first (though this means that the sort isn't stable).
Though if you "know" that the updates can't move records far from their previous sort order that isn't such an issue. The worst possible case for the example I gave above is 64+16+4+16+64 (phase one) +79+19 (phase two) +167 = 429 comparisons to merge 171 records, which still isn't too bad.
Merging small lists first would change that about 350-odd.

Binary Insertion Merging

Binary insertion merging tries to cut down the number of comparisons when merging lists of greatly different sizes. It can also cut down the number of comparisons when merging two list of about the same length, if the data are "clumpy". For example, consider the lists A (1,23,24,25,26,27,28,31,38,45,47) and B (2,3,4,29,34,35,36,41,42,43,46,50). A straight or list merge will take 21 comparisons to get the right answer. But by doing binary searches from A into B (starting in the middle of A, not the end), you can get there with far fewer. You'd start by looking for 27 in B, and find where it belongs in 3 comparisons. And then you'd have to find the five lower values of A only between the left part (2,3,4,27) of B, and the five higher values in the right part. And so on.

I've been experimenting with binary insertion merging (on 23-Feb-2001 I wrote my first sorting algorithm that uses it), but so far the results have been somwhat disappointing. The one routine I've written so far appears to perform about 4 times slower than Quicksort. Doubtlessly there's plenty of room for improvement. But my guess is I won't be able to get even close to Quicksort's performance (except for a few special cases, such as an input dataset that contains a large number of records that were previously sorted and a smaller - but still significant - number of records that have been edited in some way and are now in the wrong place).