A General Sort Package |
I have written many sort packages. I am calling this one "ZSORT" in the hope that it will be the last sort package I will ever have to write. ;-)
The package uses a counter of type "unsigned long" to count items in
temporary files, and it will fail in undefined ways if the counter
overflows.
2. Sorting
The package consists of the header file ZSORT.H, which must be included in every source code module that uses the package, and the source code module ZSORT.C, which must be compiled and linked into the resulting application. It is written in C, but it can be called from either C or C++. The source file ZSORT.TXT contains both files; you will have to separate them. The file ZSORT.C also contains a test program that you will have to remove before using the package for other purposes.
The sorting process begins with the following function call, which allocates some of the required memory and returns a handle to be passed to other functions in the package:
h = ZsortOpen(size, compare, memory, pointer); unsigned size; size of items to be sorted, in bytes (all items must be the same size) ZSORTCOMPARE compare; comparison function, as described below unsigned memory; maximum number of items to be sorted in memory void *pointer; user-defined pointer to be passed to comparison function ZSORT h; handle to be passed to other functions, or NULL if the operation fails for lack of memory
The comparison function is defined as follows:
n = compare(p, q, pointer); void *p; pointer to first item void *q; pointer to second item void *pointer; pointer passed to ZsortOpen() int n; ZCOMPABORT if sort is to be aborted ZCOMPDUPP if *p and *q are duplicates, *p is to be retained and *q is to be removed, ZCOMPDUPQ if *p and *q are duplicates, *q is to be retained and *p is to be removed, a negative value other than the foregoing if *p is to precede *q in sorted order, a positive value other than the foregoing if *p is to follow *q is sorted order, zero if the relative position of *p and *q in sorted order is unspecified
If the comparison function returns ZCOMPDUPP or ZCOMPDUPQ for any pair of items, it must NOT return 0 for any other pair of items (and vice-versa).
The comparison function must define a consistent ordering of all the items. Details in Section 7 below.
The comparison function may make changes in parts of the items not involved in comparison. These changes will appear in the items when they are retrieved.
Then items are submitted, one at a time, to the following function:
result = ZsortSubmit(h, p); ZSORT h; handle returned by successful call on ZsortOpen() const void *p; pointer to item to be submitted; the item is copied to an internal buffer, so it need not remain valid after this function returns int result; 0 successful operation ZSORTABORT operation aborted ZSORTMEMFAIL memory allocation failure ZSORTFILEERR temporary file error
The package sorts items in memory as much as possible. When the maximum specified in the call on ZsortOpen is exceeded, it writes the excess items into temporary files. If "memory" argument to ZsortOpen is very large, submitting a single item may cause a considerable delay while a block is being sorted before being written to a temporary file.
When all items have been submitted, the following function is called repeatedly to retrieve the items in sorted order:
result = ZsortRetrieve(h, p); ZSORT h; handle returned by successful call on ZsortOpen() void *p; pointer to buffer to receive retrieved item; must be allocated by caller int result; 0 successful operation ZSORTEND no more items ZSORTABORT operation aborted ZSORTMEMFAIL memory allocation failure ZSORTFILEERR temporary file error
The last call on ZsortRetrieve() returns ZSORTEND as its functional value and does not return any item.
The first call on ZsortRetrieve() usually does most of the actual sorting, so it may not return promptly.
It is not necessary to retrieve all sorted items. If ZsortSubmit() is called again before all items are retrieved, then all remaining items are lost and a new sort begins.
If ZsortRetrieve() is called again after returning ZSORTEND, it continues to return ZSORTEND.
For additional sorts on other blocks of items with the same size, comparison function and other parameters, just repeat the calls on ZsortSubmit() and ZsortRetrieve(). The memory blocks and temporary files created for the first sort will be re-used before others are allocated.
Finally, to release all memory blocks and close the temporary files, call the following function (failure to do so may cause resource leaks):
ZsortClose(h, erase); ZSORT h; handle returned by successful call on ZsortOpen() int erase; nonzero (true) if memory blocks and temporary files are to be erased before they are freed or deleted
If ZsortSubmit() or ZsortRetrieve() returns an error (ZSORTABORT, ZSORTMEMFAIL or ZSORTFILEERR), the sort must be abandoned and ZsortClose() must be called. Failure to do so may cause resource leaks.
The package calls the following standard C library functions:
fclose() fread() malloc() tmpfile() fgetc() free() memcpy() fputc() fwrite() rewind()
The package is thread-safe to a limited extent. Different threads can
perform sorts with different handles without interference, provided the
library functions that it calls are thread-safe.
3. Finding Duplicates
To determine whether a set of items contains duplicates, just write the comparison function so it aborts the sort when it is asked to compare duplicates, does NOT return 0 for non-duplicates, and implements a consistent rodering in other respects. (See Section 7 below.)
This is, in general, faster than sorting the set and then examining the result to see if any two consecutive items are duplicates. But does it always work?
The comparison function is called only about 0.5*N*logN times, although there are N*(N-1) pairs of items. How do we know it will be asked to compare the duplicates?
Assume, for purpose of contradiction, that there is a subset of two or more duplicate items, but the comparison function is not called for any pair in the subset, but only for pairs contining at least one item outside the subset. Then imagine two slightly different sets in which the items in the subset are not quite identical, but still stand in the same relationships to items outside the subset. For example:
Old set: 9, 1, 2, 3, 4, 4, 4, 5, 6, 7 New set: 9, 1, 2, 3, 4.1, 4.2, 4.3, 5, 6, 7 New set: 9, 1, 2, 3, 4.2, 4.1, 4.3, 5, 6, 7
Now sort the two new sets with the same algorithm. Since the items in
the subset are never compared with each other, The sort handles them
identically. But this would put at least one of the new sets into the
wrong order. This is the contradiction we sought.
4. Counting Distinct Items
To determine the number of distinct items in a set, just write the
comparison function to discard duplicates. Then submit all items, and
count the number of times ZsortRetrieve() returns 0.
5. Counting Item Frequencies
Suppose we need a frequency list of a set of items; i.e., a list of the distinct items, and the number of times each one appears in the set. We could simply sort the set and then run through the results. However, there is a more efficient way, especially if the number of distinct items is much less than the number of items.
Combine each item with a count, which is initialized to 1 when the item
is submitted. When the comparison function finds duplicate items, let it
add the count in the item to be discarded to the count in the item to be
retained. The resulting sorted set is the desired frequency list.
6. Stable Sorts
A stable sort is one that retains the relative position of items that the comparison function does not rank.
As it stands, ZSORT is not stable. If a stable sort is desired, an
increasing serial number should be appended to each item before it is
submitted to ZsortSubmit(). Then the comparison function can rank
otherwise equal items by their serial numbers.
7. Consistent Ordering
The comparison function must define a consistent ordering of items. All of the most commonly used orderings (alphabetic, numerical, chronological, etc.) are consistent. However, in special cases it may be necessary to examine all the relevant rules.
As noted above, if the comparison function returns ZCOMPDUPP or ZCOMPDUPQ for any pair of items, it must NOT return 0 for any other pair of items (and vice-versa).
Two items which are identical or for which the comparison function returns 0, ZCOMPDUPP or ZCOMPDUPQ are called equivalent. This relationship must be a mathematically sound equivalence relation; i.e., it must be reflexive, symmetric and transitive:
In returning other values, the comparison function must obey these conditions:
If the definitions of ZCOMPDUPP and ZCOMPDUPQ are removed from ZSORT.H, then the duplication-removal feature of the package will be disabled, and the code that implements it will be removed.
Similarly, if the definition of ZCOMPABORT is removed from ZSORT.H, then the abort feature of the package will be disabled, and the code that implements it will be removed.
Care should be taken that the special values ZCOMPABORT, ZCOMPDUPP and ZCOMPDUPQ are not returned as a result of a regular comparison. In the file ZSORT.H, these are set to extreme values in the range of the type "int" for 32-bit machines. If you are compiling the package for a 16-bit machine, you will have to change these values.
The standard comparison functions strcmp(), stricmp(), memcmp() and memicmp() usually return only values in some rather limited range, generally between -256 and 256, inclusive. However, this is not absolutely guaranteed. For absolute portability, you may have to avoid using the library versions of these functions and use substitutes such as the following:
int strcmp(const char *p, const char *q) { int n; while (1) { n = * (unsigned char *) p - * (unsigned char *) q; if (n != 0 || *p == 0) break; p++; q++; } return n; } int stricmp(const char *p, const char *q) { int n; while (1) { n = toupper(* (unsigned char *) p) - toupper(* (unsigned char *) q); if (n != 0 || *p == 0) break; p++; q++; } return n; } int memcmp(const void *p, const void *q, unsigned size) { int n = 0; unsigned i; for (i = 0; i < size; i++) { n = ((unsigned char *) p)[i] - ((unsigned char *) q)[i]; if (n != 0) break; } return n; } int memicmp(const void *p, const void *q, unsigned size) { int n = 0; unsigned i; for (i = 0; i < size; i++) { n = toupper(((unsigned char *) p)[i]) - toupper(((unsigned char *) q)[i]); if (n != 0) break; } return n; }
Alternatively, you can use the regular library function and put a test after it, like this:
n = strcmp(p, q); if (n < 0) n = -1;This works because the values of ZCOMPABORT, ZCOMPDUPP and ZCOMPDUPQ are all negative.
A Ukranian translation of this page by Irma Alekseeva is available at smartranslator.net/programming/zsort.