LOGO

Merge Sort

Title:
Merge Sort
Language:
C
Author:
Philip J. Erdelsky
Date:
July 31, 1998
Usage:
Public domain; no restrictions on use
Portability:
Any C compiler
Keywords:
Sort
Abstract:
A C function to sort a file, using arbitrary read, write and comparison functions and a tape-merge algorithm, which is O(n log n) in all cases. The function is non-recursive and re-entrant.

The function merge_sort() will sort virtually any kind of file, using read, write and comparison functions supplied by the calling program.

  1. It sorts records of variable size.
  2. It requires O(n log n) comparisons, where n is the number of records, regardless of the initial order. The time to sort a file is predictable and consistent. This performance is known to be optimal for general sorts.
  3. The number of times a record must be read from, or written to, a disk file is also O(n log n).
  4. The algorithm is iterative, not recursive, so there is no problem of stack overflow. The stack requirement is predictable, consistent and small.
  5. It requires approximately enough disk space to hold two copies of the file. The sorted file is left in this space.
  6. It requires at least enough memory to hold three records, but it will run faster if it is provided with more memory.
  7. It calls tmpfile() to create as many as four temporary files at a time.
  8. The unsorted file is read sequentially, and the sorted file is written sequentially. Hence they may be tapes, I/O streams or other strictly sequential devices.

The function call is as follows:

     result = merge_sort(unsorted_file, sorted_file, read, write, compare,
       pointer, max_record_size, block_size, pcount);

The calling program must supply the following parameters:

FILE *unsorted_file;
A file pointer for the file to be sorted. The file must have been successfully opened for reading, and the file pointer must be positioned at the beginning of the file.
FILE *sorted_file;
A file pointer for the sorted file. The file must have been successfully opened for writing, and the file pointer must be positioned at the beginning of the file.
int (*read)();
A function that reads a single record.
int (*write)();
A function that writes a single record.
int (*compare)();
A function that compares two records.
void *pointer;
A user-defined pointer to be passed to the (*read)(), (*write)() and (*compare)() functions when they are called.
unsigned max_record_size;
The maximum number of bytes in a record, when it is read into memory. This need not be the same as the size of the record when it resides in a file.
unsigned long block_size;
The number of records to be sorted in memory, or 1L if memory sorting is not to be used.
unsigned long *pcount;
A pointer to a variable to receive the record count (if the sort is successful), or NULL if this information is not to be returned.

The function returns the following values:

int result;
Result code:
  • 0 for a successful sort
  • 1 for insufficient memory
  • 2 for a file creation error
  • 3 for a file write error

The unsorted and sorted files will not be rewound or closed. Each file will be left open, with the file pointer positioned at the end of the file. However, if the two file pointers are identical, the file will be rewound and the sorted records will be written over it.

The (*read)() function is called by merge_sort() as follows, and must read one record each time it is called (except when the end of the unsorted file is detected):

     n = (*read)(fp, buffer, pointer);
FILE *fp;
A file pointer for the unsorted file or a temporary file created by calling tmpfile().
void *buffer;
A pointer to a buffer to receive one record. This buffer can hold at most max_record_size bytes.
void *pointer;
A copy of the pointer passed as an argument to merge_sort().
int n;
The number of bytes in the record, or zero if an attempt was made to read past the end of the unsorted file.

When this function returns zero, it will not be called again for the same file. No attempt will be made to read past the end of a temporary file.

A record larger than the specified maximum size must be truncated or dealt with in some other suitable manner. The function will fail catastrophically if the buffer is allowed to overflow, or if the value returned by (*read)() is larger than the maximum record size.

The record format may be changed when it is read into memory, provided that:

  1. the changes are compatible with the (*write)() and (*compare)() functions, and
  2. the value returned by the (*read)() function must be the number of bytes in the record while it is in the memory buffer.

For example, the line terminator \r\n in DOS/Windows might be converted to \n when it is read into memory.

The (*write)() function is called as follows, and must write one record each time it is called:

     n = (*write)(fp, buffer, pointer);
FILE *fp;
A file pointer for the sorted file or a temporary file created by calling tmpfile().
void *buffer;
A pointer to a buffer holding one record, as read into memory by (*read)().
void *pointer;
A copy of the pointer passed as an argument to merge_sort().
int n;
Nonzero for a successful write; zero for insufficient disk space.

Notice that the record length is not passed as a parameter. It must be contained, explicitly or implicitly, in the record itself or in some other place accessible to the (*write)() function.

The (*compare)() function is called to compare two records, as follows:

     n = (*compare)(p, q, pointer);
void *p;
A pointer to a buffer containing the first record, as read into memory by (*read)().
void *q;
A pointer to a buffer containing the second record, as read into memory by (*read)().
void *pointer;
A copy of the pointer passed as an argument to merge_sort().
int n;
The comparison result, as follows:
  • >0 if *p is to be after *q in sorted order
  • <0 if *p is to be before *q in sorted order
  • 0 if the order of *p and *q is irrelevant

The sort proceeds as follows:

  1. The records are read from the unsorted file and written to two temporary files in blocks. Each block contains the number of records specified by the block_size argument (except the last block, which may contain fewer records), and it is sorted in memory by calling linked_list_sort() before it is written to a temporary file.
  2. The temporary files are then sorted by merging blocks in a number of passes. In each pass, blocks from two source files are merged into blocks containing twice as many records, and are written to two output files. The process terminates when the block size equals or exceeds the file size and all records are in a single file, which is the sorted file.

The sort can fail if

  1. a call on malloc() returns NULL because there is not enough memory, or
  2. a call on tmpfile() returns NULL because there are too many open files, or
  3. a call on (*write)() returns zero because there is insufficient disk space.

Whether the sort is successful or not, all allocated memory blocks will be deallocated, and all temporary files will be closed and deleted. If the sort fails, the file pointers for the unsorted and sorted files will be left wherever they happened to be.

The merge_sort() function is reentrant if the functions that it calls are reentrant:

Under DOS/Windows, tmpfile() creates a BINARY file. However, this usually creates no problems, even if the unsorted and sorted files are text files, because DOS/Windows handles the conversions in a consistent manner.

If disk space is extremely tight, the sort can be modified to use the space occupied by the unsorted file. The unsorted file, temporary files, and sorted file can then fit into a space approximately twice as large as the unsorted file. To make this modification, just insert code to delete the unsorted file after it has been read. Notice that if this technique is used, the sorted and unsorted files must be different, and the argument passing structure may become a little less elegant because the function call to delete a file usually requires file specifications, not merely a file pointer.

The merge_sort() function is not stable; i.e., it does not preserve the relative positions of two records for which the (*compare)() function returns zero. The function stable_merge_sort() does have this property. However, the additional feature carries a price: a four-byte record number is added to each record while it is in memory or in a temporary file. This increases the disk and memory requirements, by a substantial amount if the records are small.

The SORT utility illustrates the use of stable_merge_sort(). It is a DOS or UNIX filter which sorts the lines of a text file. The maximum line length is MAX_LINE characters, not including the carriage return and line feed at the end of each line. The utility will abort the sort and display an error message if it reads a line containing more than MAX_LINE characters. The value of MAX_LINE is set at 255, but it can be changed by recompiling the utility.

The utility is called as follows:

     SORT  <unsorted_file  >sorted_file  M  N

The sort includes columns number M to N, inclusive, where the leftmost column is number zero. If N is omitted, the sort includes column M and all columns to the right of it. If both M and N are omitted, all columns are used in the sort.

A line containing fewer than N+1 columns is sorted as though it were padded with nulls at the right end.

Characters are ranked by their ASCII codes, considered as UNSIGNED bytes.

This package calls on another package:

Source code in text format:

An Indonesian translation of this page has been posted at www.chameleonjohn.com/translations/mergesor-Indonesian as part of ChameleonJohn.

A Romanian translation of this page has been posted at http://www.dontpayfull.com/page/merge-sort by Irina Vasilescu.