Site hosted by Angelfire.com: Build your free website today!

Merge Sorts

Introduction

Natural Merge

Well, I haven't even read this bit in Knuth. Let alone understood it. Let alone written one!

Straight Merge

             
       
    
  
The diagram at left shows a typical partitioning scheme for Straight Merge (sorting 13 elements). Imagine each box represents a portion of the array that is known to be sorted (initially each box contains only one element!). In each merging pass, we merge adjacent sorted areas. In a simple merging algorithm, where the lists are about the same size (as they are here), you compare elements one at a time from the two sorted portions.
In Knuth's "The Art of Computer Programming", volume 2, in the discussion of Quicksort, he mentions that, when a Quicksort run is taking longer than expected, it may be worth switching over to a Straight Mergesort (because when Quicksort goes bad it goes really bad).
Usually, a Straight Merge is slower than Quicksort, because it requires more memory to keep track of what is to go into each merged partition. But it has a big advantage over Quicksort - its worse case is hardly any worse than its best case. And its best case is none too shabby.

It is, however, easy to see roughly how long a Straight Merge is going to take, because the partition boundaries are fixed. There's some variability. When one of the two lists going into a partition runs out, it's quicker to copy the remaining elements (you don't have to do any more comparisons), so Straight Merge does slightly (but only slightly) better for data already mostly in the ascending or descending order (If you think your data will be mostly sorted, see the Natural Merge, which is really good at that, and the List Merge, which is even better. The Straight Insertion or the much-maligned bubble sort won't do too badly either).

Here's a link to an example Mergesort include file (this is an ASP implementation), which sorts arrays of variant arrays, using their indices. It's a good example of how NOT to write sorting routines. My apologies. It tries to handle Nulls and case sensitive comparisons (which makes it quite a bit longer - without all that the CompareVariants routine could have been pretty much replaced with "<"). It also allows you to specify a multiple column sort key. Given that - as I've impemented it there - Straight Mergesort is a stable sort that wasn't necessary. To sort by columns A and B, you can merely sort by B and then by A.

(Note that the textbook version of Straight Mergesort, as given by Knuth, is not stable, because it has the "left-over" bits in the centre and merges from both ends. The textbook version is a little faster than mine, but I wanted stability.]

The routines that really matter are
  1. TwoWayMergeSubArrays(), which merges two sorted sub-arrays
  2. ReorderTargetByRef(), which given a target array and an array of indices indicating how it is to be reordered, shuffles its order. In later articles in this series I'll cover the major performance advantages you can get (in VB, if not in VBScript!)
  3. SimpleMergesortArrayOfArrays(), which is an iterative implementation of a Straight Merge Sort. It uses two arrays to hold indices. At any given time, one holds the result of the last merging run (horizontal row in the pink table above), and the other holds the partially complete results of the current run. In VBScript, using two arrays means the routine is twice as long. In VB, it is possible to swap the values of variants using diabolical API functions, which speeds things up considerably, and requires less code.

List Merge

List merge attempts to cut down on the amount of copying of elements done by a Straight Merge (which ends up moving most elements - or index, if the sort is indexed - about Ceiling(logN)+1 times), by using linked allocation. Instead of moving records about it merges linked lists (this has about the same effect as cutting down on the moves by half).

Exploitative List Merge

This is a beast of my own crafting that attempts to exploit long runs in the input (it's suited to situations where, for example, you had a sort order, and only a very few records have since been updated, but you don't know which ones).

Binary Insertion Merging

Binary insertion merging tries to cut down the number of comparisons when merging lists of greatly different sizes. It can also cut down the number of comparisons when merging two list of about the same length, if the data are "clumpy". For example, consider the lists A (1,23,24,25,26,27,28,31,38,45,47) and B (2,3,4,29,34,35,36,41,42,43,46,50). A straight or list merge will take 21 comparisons to get the right answer. But by doing binary searches from A into B (starting in the middle of A, not the end), you can get there with far fewer. You'd start by looking for 27 in B, and find where it belongs in 3 comparisons. And then you'd have to find the five lower values of A only between the left part (2,3,4,27) of B, and the five higher values in the right part. And so on.

I've been experimenting with binary insertion merging (on 23-Feb-2001 I wrote my first sorting algorithm that uses it), but so far the results have been somwhat disappointing. The one routine I've written so far appears to perform about 4 times slower than Quicksort. Doubtlessly there's plenty of room for improvement. But my guess is I won't be able to get even close to Quicksort's performance (except for a few special cases, such as an input dataset that contains a large number of records that were previously sorted and a smaller - but still significant - number of records that have been edited in some way and are now in the wrong place).