[TECH] Algorithmic details of UNIX Sort command.

I happened to look at the algorithmic details of UNIX Sort, a LINUX version of
the classic UNIX sort is a part of GNU coreutils-6.9.90. This is classic
example of the standard External R-Way merge , to sort a data of size N bytes with a main memory size of M so it creates N/M runs and merges R at a time, the number of passes through the data is log(N/M)/log(R)
passes.In fact the lower bound(runtime) for external sorting is Ω((N/M)log(N/M)/log(R)). All the external memory sorting algorithms provided in the literature are optimal so the fight here is minimizing the constant before the number of passes.

UNIX sort treats keys are lines (strings), the algorithm followed by
unix sort is in fact the R-Way merge. Let the input file size be
IN_SIZE.

1. Choosing Run Size:
——————————–
The sizes of the initial runs are chosen from the total physical
memory (TOTAL_PHY) and available memory (AVAIL_PHY).
RUN_SIZE = (MAX(TOTAL_PHY/8,AVAIL_PHY))/2

maximum of 1/8th of TOTAL_PHY and AVAIL_PHY and divided by 2.
See function “default_sort_size (void)” in the code.

2. Creating Runs:
————————-

Unix sort creates a temporary file for every run. So it creates
IN_SIZE/RUN_SIZE (celing) temporary files. Internally it uses merge
sort to sort internally it uses an optimization mentioned in Knuth
volume 3 (2nd edition), problem 5.2.4-23.

3. Merging:
—————-

The number of runs merged at any time is hard coded in the program
see macro NMERGE , NMERGE is defined to be 16 so it merges
exactly 16 runs at any time.

~ by Vamsi Kundeti on March 10, 2008.

Leave a Reply