LOGO

A General Sort Package

Title:
A General Sort Package
Language:
C
Author:
Philip J. Erdelsky
Date:
March 10, 2001
Usage:
Public domain; no restrictions on use
Portability:
Any C compiler
Keywords:
Sorting
Abstract:
A fairly general package for sorting arbitrary items, all of the same size, using an in-memory merge sort and a file merge sort when memory is insufficient.
Source:
zsort.txt

Table of Contents

  1. Introduction
  2. Sorting
  3. Finding Duplicates
  4. Counting Distinct Items
  5. Counting Item Frequencies
  6. Stable Sorts
  7. Consistent Ordering
  8. Modifications

1. Introduction

I have written many sort packages. I am calling this one "ZSORT" in the hope that it will be the last sort package I will ever have to write. ;-)

  1. It handles items of a completely arbitrary nature, except that they must all be the same size.
  2. It uses the merge sort algorithm, which is non-recursive and of order NlogN in all cases.
  3. It can eliminate duplicate items during the sort, if this option is specified.
  4. It can be aborted if a user-specified condition is detected while two items are being compared.
  5. It is fairly efficient in its use of resources. Small sorts can be done in memory.
  6. It uses temporary files to handle large sorts.

The package uses a counter of type "unsigned long" to count items in temporary files, and it will fail in undefined ways if the counter overflows.

2. Sorting

The package consists of the header file ZSORT.H, which must be included in every source code module that uses the package, and the source code module ZSORT.C, which must be compiled and linked into the resulting application. It is written in C, but it can be called from either C or C++. The source file ZSORT.TXT contains both files; you will have to separate them. The file ZSORT.C also contains a test program that you will have to remove before using the package for other purposes.

The sorting process begins with the following function call, which allocates some of the required memory and returns a handle to be passed to other functions in the package:

     h = ZsortOpen(size, compare, memory, pointer);

     unsigned size;     size of items to be sorted, in bytes (all items
                        must be the same size)

     ZSORTCOMPARE compare;  comparison function, as described below

     unsigned memory;   maximum number of items to be sorted in memory

     void *pointer;     user-defined pointer to be passed to comparison
                        function

     ZSORT h;           handle to be passed to other functions, or NULL
                        if the operation fails for lack of memory

The comparison function is defined as follows:

     n = compare(p, q, pointer);

     void *p;           pointer to first item

     void *q;           pointer to second item

     void *pointer;     pointer passed to ZsortOpen()

     int n;             ZCOMPABORT if sort is to be aborted

                        ZCOMPDUPP if *p and *q are duplicates, *p is to
                        be retained and *q is to be removed,

                        ZCOMPDUPQ if *p and *q are duplicates, *q is to
                        be retained and *p is to be removed,

                        a negative value other than the foregoing if *p
                        is to precede *q in sorted order,

                        a positive value other than the foregoing if *p
                        is to follow *q is sorted order,

                        zero if the relative position of *p and *q in
                        sorted order is unspecified

If the comparison function returns ZCOMPDUPP or ZCOMPDUPQ for any pair of items, it must NOT return 0 for any other pair of items (and vice-versa).

The comparison function must define a consistent ordering of all the items. Details in Section 7 below.

The comparison function may make changes in parts of the items not involved in comparison. These changes will appear in the items when they are retrieved.

Then items are submitted, one at a time, to the following function:

     result = ZsortSubmit(h, p);

     ZSORT h;           handle returned by successful call on
                        ZsortOpen()

     const void *p;     pointer to item to be submitted; the item is
                        copied to an internal buffer, so it need not
                        remain valid after this function returns

     int result;        0            successful operation
                        ZSORTABORT   operation aborted
                        ZSORTMEMFAIL memory allocation failure
                        ZSORTFILEERR temporary file error

The package sorts items in memory as much as possible. When the maximum specified in the call on ZsortOpen is exceeded, it writes the excess items into temporary files. If "memory" argument to ZsortOpen is very large, submitting a single item may cause a considerable delay while a block is being sorted before being written to a temporary file.

When all items have been submitted, the following function is called repeatedly to retrieve the items in sorted order:

     result = ZsortRetrieve(h, p);

     ZSORT h;           handle returned by successful call on
                        ZsortOpen()

     void *p;           pointer to buffer to receive retrieved item;
                        must be allocated by caller

     int result;        0            successful operation
                        ZSORTEND     no more items
                        ZSORTABORT   operation aborted
                        ZSORTMEMFAIL memory allocation failure
                        ZSORTFILEERR temporary file error

The last call on ZsortRetrieve() returns ZSORTEND as its functional value and does not return any item.

The first call on ZsortRetrieve() usually does most of the actual sorting, so it may not return promptly.

It is not necessary to retrieve all sorted items. If ZsortSubmit() is called again before all items are retrieved, then all remaining items are lost and a new sort begins.

If ZsortRetrieve() is called again after returning ZSORTEND, it continues to return ZSORTEND.

For additional sorts on other blocks of items with the same size, comparison function and other parameters, just repeat the calls on ZsortSubmit() and ZsortRetrieve(). The memory blocks and temporary files created for the first sort will be re-used before others are allocated.

Finally, to release all memory blocks and close the temporary files, call the following function (failure to do so may cause resource leaks):

     ZsortClose(h, erase);

     ZSORT h;           handle returned by successful call on
                        ZsortOpen()

     int erase;         nonzero (true) if memory blocks and temporary
                        files are to be erased before they
                        are freed or deleted

If ZsortSubmit() or ZsortRetrieve() returns an error (ZSORTABORT, ZSORTMEMFAIL or ZSORTFILEERR), the sort must be abandoned and ZsortClose() must be called. Failure to do so may cause resource leaks.

The package calls the following standard C library functions:

     fclose()  fread()   malloc()  tmpfile()
     fgetc()   free()    memcpy()
     fputc()   fwrite()  rewind()

The package is thread-safe to a limited extent. Different threads can perform sorts with different handles without interference, provided the library functions that it calls are thread-safe.

3. Finding Duplicates

To determine whether a set of items contains duplicates, just write the comparison function so it aborts the sort when it is asked to compare duplicates, does NOT return 0 for non-duplicates, and implements a consistent rodering in other respects. (See Section 7 below.)

This is, in general, faster than sorting the set and then examining the result to see if any two consecutive items are duplicates. But does it always work?

The comparison function is called only about 0.5*N*logN times, although there are N*(N-1) pairs of items. How do we know it will be asked to compare the duplicates?

Assume, for purpose of contradiction, that there is a subset of two or more duplicate items, but the comparison function is not called for any pair in the subset, but only for pairs contining at least one item outside the subset. Then imagine two slightly different sets in which the items in the subset are not quite identical, but still stand in the same relationships to items outside the subset. For example:

     Old set:  9, 1, 2, 3, 4,   4,   4,   5, 6, 7

     New set:  9, 1, 2, 3, 4.1, 4.2, 4.3, 5, 6, 7

     New set:  9, 1, 2, 3, 4.2, 4.1, 4.3, 5, 6, 7

Now sort the two new sets with the same algorithm. Since the items in the subset are never compared with each other, The sort handles them identically. But this would put at least one of the new sets into the wrong order. This is the contradiction we sought.

4. Counting Distinct Items

To determine the number of distinct items in a set, just write the comparison function to discard duplicates. Then submit all items, and count the number of times ZsortRetrieve() returns 0.

5. Counting Item Frequencies

Suppose we need a frequency list of a set of items; i.e., a list of the distinct items, and the number of times each one appears in the set. We could simply sort the set and then run through the results. However, there is a more efficient way, especially if the number of distinct items is much less than the number of items.

Combine each item with a count, which is initialized to 1 when the item is submitted. When the comparison function finds duplicate items, let it add the count in the item to be discarded to the count in the item to be retained. The resulting sorted set is the desired frequency list.

6. Stable Sorts

A stable sort is one that retains the relative position of items that the comparison function does not rank.

As it stands, ZSORT is not stable. If a stable sort is desired, an increasing serial number should be appended to each item before it is submitted to ZsortSubmit(). Then the comparison function can rank otherwise equal items by their serial numbers.

7. Consistent Ordering

The comparison function must define a consistent ordering of items. All of the most commonly used orderings (alphabetic, numerical, chronological, etc.) are consistent. However, in special cases it may be necessary to examine all the relevant rules.

As noted above, if the comparison function returns ZCOMPDUPP or ZCOMPDUPQ for any pair of items, it must NOT return 0 for any other pair of items (and vice-versa).

Two items which are identical or for which the comparison function returns 0, ZCOMPDUPP or ZCOMPDUPQ are called equivalent. This relationship must be a mathematically sound equivalence relation; i.e., it must be reflexive, symmetric and transitive:

  1. (Reflexive) Any element is equivalent to itself.
  2. (Symmetric) If A is equivalent to B, then B is equivalent to A.
  3. (Transitive) If A is equivalent to B and B is equivalent to C, then A is equivalent to C.

In returning other values, the comparison function must obey these conditions:

  1. If compare(p,q,pointer)>0, then compare(q,p,pointer)<0, and vice-versa.
  2. (Transitive) If compare(p,q,pointer)>0 and compare(q,r,pointer)>0, then compare(p,r,pointer)>0.
  3. If compare(p,q,pointer)>0, *p is equivalent to *pp, and *q is equivalent to *qq, then compare(pp,qq,pointer)>0.

8. Modifications

If the definitions of ZCOMPDUPP and ZCOMPDUPQ are removed from ZSORT.H, then the duplication-removal feature of the package will be disabled, and the code that implements it will be removed.

Similarly, if the definition of ZCOMPABORT is removed from ZSORT.H, then the abort feature of the package will be disabled, and the code that implements it will be removed.

Care should be taken that the special values ZCOMPABORT, ZCOMPDUPP and ZCOMPDUPQ are not returned as a result of a regular comparison. In the file ZSORT.H, these are set to extreme values in the range of the type "int" for 32-bit machines. If you are compiling the package for a 16-bit machine, you will have to change these values.

The standard comparison functions strcmp(), stricmp(), memcmp() and memicmp() usually return only values in some rather limited range, generally between -256 and 256, inclusive. However, this is not absolutely guaranteed. For absolute portability, you may have to avoid using the library versions of these functions and use substitutes such as the following:

     int strcmp(const char *p, const char *q)
     {
       int n;
       while (1)
       {
         n = * (unsigned char *) p - * (unsigned char *) q;
         if (n != 0 || *p == 0)
           break;
         p++;
         q++;
       }
       return n;
     }

     int stricmp(const char *p, const char *q)
     {
       int n;
       while (1)
       {
         n = toupper(* (unsigned char *) p) -
           toupper(* (unsigned char *) q);
         if (n != 0 || *p == 0)
           break;
         p++;
         q++;
       }
       return n;
     }

     int memcmp(const void *p, const void *q, unsigned size)
     {
       int n = 0;
       unsigned i;
       for (i = 0; i < size; i++)
       {
         n = ((unsigned char *) p)[i] - ((unsigned char *) q)[i];
         if (n != 0)
           break;
       }
       return n;
     }

     int memicmp(const void *p, const void *q, unsigned size)
     {
       int n = 0;
       unsigned i;
       for (i = 0; i < size; i++)
       {
         n = toupper(((unsigned char *) p)[i]) -
           toupper(((unsigned char *) q)[i]);
         if (n != 0)
           break;
       }
       return n;
     }

Alternatively, you can use the regular library function and put a test after it, like this:

     n = strcmp(p, q);
     if (n < 0)
       n = -1;
This works because the values of ZCOMPABORT, ZCOMPDUPP and ZCOMPDUPQ are all negative.