External Sorting

We occasionally need to sort collections of data so large that they do not fit in memory. The algorithms we've explored previously, such as heapsort, do not work efficiently for external sorting. While the ith element of an array can be accessed very quickly, accessing the ith object stored in a file on disk is horrendously slow. As mentioned previously, any action involving the moving physical parts of a disk drive can take a million times as long as accessing memory. Worse, if the objects in the file are of different sizes, or if the data are stored on a sequential access device like a magnetic tape drive, finding the ith element can take time linear in the size of the file.

The process of sorting data too big to fit in memory is called external sorting. This section presents an external sorting algorithm based on merge sort (Section 9.3). We begin by dividing the data into many short runs. Each run is small enough to fit into memory, so we can sort the individual runs. The runs are then repeatedly merged into longer and longer runs, until there is only one run left.

Let c be the number of data that can fit in memory. We initially make runs of length c (Figure 17-29). (The last run, containing any "spare change," may be shorter than this.) These are then merged into runs of length 2c, then 4c, and so on, until all of the data are in the same run.

Figure 17-29. External sorting. The original data file is split into two files, each consisting of sorted runs. Bold, italic type is used to emphasize where one run ends and another begins. In the initial split, runs are short enough to fit in memory; in this example, the initial runs are of length 3 or less. These are merged into runs of length 6 or less, then 12 or less, and finally 24 or less. This final run is long enough to contain all of the data, which are now sorted.

Original Data

Split

 

First Merge

Second Merge

Result

t

g

c

c

e

c

a

a

g

m

j

g

f

e

b

b

m

t

q

j

i

f

d

c

j

e

f

m

l

g

h

d

q

l

i

q

n

i

k

e

c

n

s

t

s

j

o

f

n

a

b

a

h

l

p

g

e

d

k

b

r

m

r

h

l

o

p

d

 

n

 

i

i

h

 

k

 

q

 

j

f

r

 

o

 

s

 

k

s

   

p

 

t

 

l

a

           

m

d

           

n

o

           

o

p

           

p

k

           

q

b

           

r

r

           

s

h

           

t

The code we now present sorts the lines of a text file. There are several places in the algorithm where we have to determine whether there is another line left in a run. A run limited to length 4c, for example, might run out either because we've taken 4c lines from it or because we've hit the end of the file. To simplify the main code, these details are encapsulated in the ExternalSortRun class (Figure 17-30).

Figure 17-30. The ExternalSortRun class manages the details of each run.

1 import java.util.Scanner; 2 3 /** A run of lines used by the ExternalSort program. */ 4 public class ExternalSortRun { 5 6 /** Number of lines left in this run, possibly an overestimate. */ 7 private int count; 8 9 /** The next available line, if any. */ 10 private String next; 11 12 /** The Scanner from which the lines are drawn. */ 13 private Scanner scanner; 14 15 /** Up to maxLength lines will be drawn from scanner. */ 16 public ExternalSortRun(Scanner scanner, int maxLength) { 17 count = maxLength; 18 this.scanner = scanner; 19 if (scanner.hasNext()) { 20 next = scanner.nextLine(); 21 } else { 22 count = 0; 23 } 24 } 25 26 /** Return true if there is another line in this run. */ 27 public boolean hasNext() { 28 return count > 0; 29 } 30 31 /** Return the next available line and advance the run. */ 32 public String next() { 33 String result = next; 34 count--; 35 if (count > 0) { 36 if (scanner.hasNext()) { 37 next = scanner.nextLine(); 38 } else { 39 count = 0; 40 } 41 } 42 return result; 43 } 44 45 /** Return the next available line but do not advance the run. */ 46 public String peek() { 47 return next; 48 } 49 50 }

We describe the ExternalSort class (Figure 17-31) in a top-down fashion. The main() method invokes sort() on Files corresponding to args[0] and args[1]. Thus, to sort the file poetry.txt and put the output in sorted.txt, we would invoke our program as:

java ExternalSort poetry.txt sorted.txt.

 

Figure 17-31. Top-level methods in the ExternalSort class.

1 import java.io.*; 2 import java.util.Scanner; 3 4 /** Externally sort the lines of a text file. */ 5 public class ExternalSort { 6 7 /** Maximum number of lines stored in memory at any one time. */ 8 public static final int CAPACITY = 3; 9 10 /** Sort the file in and write the output to out. */ 11 public static void sort(File in, File out) throws IOException { 12 File[][] files = {{new File(in.getPath() + ".a0"), 13 new File(in.getPath() + ".a1")}, 14 {new File(in.getPath() + ".b0"), 15 new File(in.getPath() + ".b1")}}; 16 split(in, files[0]); 17 int runLength = CAPACITY; 18 int i = 0; 19 while (merge(files[i], files[1 - i], runLength)) { 20 i = 1 - i; 21 runLength *= 2; 22 } 23 files[1 - i][0].renameTo(out); 24 for (i = 0; i < 2; i++) { 25 for (int j = 0; j < 2; j++) { 26 files[i][j].delete(); 27 } 28 } 29 } 30 31 /** 32 * Sort the file args[0] and write the output to args[1]. 33 */ 34 public static void main(String[] args) throws IOException { 35 sort(new File(args[0]), new File(args[1])); 36 } 37 38 }

Lines 1215 create four new files. If in is the file poetry.txt, these are poetry.txt.a0, poetry.txt.a1, poetry.txt.b0, and poetry.txt.b1. These files are used during the repeated mergings. On line 16, in is split into the two files in files[0]that is, those with 'a' in their final extension. The bulk of the workthe merginghappens on lines 1922. The invocation of merge() on line 19 both performs one round of merging and returns a boolean telling us whether there is more work to do. The variable i toggles back and forth between 0 and 1, so the output of the first merging goes to files[1], the next merging overwrites files[0] (which are no longer needed), the next merging uses files[1] again, and so on. Only four temporary files are needed now no matter how many rounds of merging there are. Line 23 renames the first temporary file to contain all of the data. Lines 2428 delete the temporary files.

The split() method (Figure 17-32) uses the same trick with the variable i. On each pass through the loop on lines 1019, the method reads up to CAPACITY lines from input, storing them in a SortableArrayList. This ArrayList is internally sorted on line 15. (In a professional implementation, we would use a faster internal sorting algorithm than insertion sort.) Since the size of the ArrayList is no more than CAPACITY, it's okay to sort it internally. Once sorted, the lines are printed to one of the output files.

Figure 17-32. Splitting a file into two files of sorted runs.

1 /** 2 * Split in into runs of maximum length CAPACITY and write them 3 * to out[0] and out[1]. 4 */ 5 protected static void split(File in, File[] out) 6 throws IOException { 7 Scanner input = new Scanner(new FileInputStream(in)); 8 PrintWriter[] output = {new PrintWriter(out[0]), 9 new PrintWriter(out[1])}; 10 int i = 0; 11 while (input.hasNext()) { 12 SortableArrayList run = new SortableArrayList(); 13 for (int j = 0; (input.hasNext()) && (j < CAPACITY); j++) { 14 run.add(input.nextLine()); 15 } 16 run.insertionSort(); 17 for (String s : run) { 18 output[i].println(s); 19 } 20 i = 1 - i; 21 } 22 output[0].close(); 23 output[1].close(); 24 }

The merge() method (Figure 17-33) is similar to the one from the internal merge sort algorithm (Section 9.3), but it uses the ExternalSortRun class. This class encapsulates certain details, including the situation in which the last run in a file contains less than runLength lines.

Figure 17-33. The core of the merge() method is very similar to the one from the internal merge sort algorithm discussed in Section 9.3.

1 /** 2 * Merge runs, of maximum length runLength, in the files in[0] and 3 * in[1], into runs twice this length in out[0] and out[1]. 4 * Return true if both output files are needed. 5 */ 6 protected static boolean merge(File[] in, File[] out, int runLength) 7 throws IOException { 8 boolean bothOutputsUsed = false; 9 Scanner[] input = {new Scanner(new FileInputStream(in[0])), 10 new Scanner(new FileInputStream(in[1]))}; 11 PrintWriter[] output = {new PrintWriter(out[0]), 12 new PrintWriter(out[1])}; 13 int i = 0; 14 while (input[0].hasNext() || input[1].hasNext()) { 15 ExternalSortRun[] runs 16 = {new ExternalSortRun(input[0], runLength), 17 new ExternalSortRun(input[1], runLength)}; 18 if (i == 1) { 19 bothOutputsUsed = true; 20 } 21 while ((runs[0].hasNext()) || (runs[1].hasNext())) { 22 if ((!runs[1].hasNext()) 23 || ((runs[0].hasNext()) 24 && (runs[0].peek().compareTo(runs[1].peek()) < 0))) { 25 output[i].println(runs[0].next()); 26 } else { 27 output[i].println(runs[1].next()); 28 } 29 } 30 i = 1 - i; 31 } 32 output[0].close(); 33 output[1].close(); 34 return bothOutputsUsed; 35 }

There is no external sorting algorithm built into Java, but the UNIX utility sort provides similar functionality.

Exercises

17.9

If there is enough memory, will increasing the length of each run in the initial split make the external sorting algorithm run more quickly? Explain.

17.10

Modify the ExternalSort program so that the sorted data are printed to System.out rather than stored in a file if no second command-line argument is given.

Категории