Comparing Directory Trees
Engineers can be a paranoid sort (but you didn't hear that from me). At least I am. It comes from decades of seeing things go terribly wrong, I suppose. When I create a CD backup of my hard drive, for instance, there's still something a bit too magical about the process to trust the CD writer program to do the right thing. Maybe I should, but it's tough to have a lot of faith in tools that occasionally trash files, and seem to crash my Windows 98 machine every third Tuesday of the month. When push comes to shove, it's nice to be able to verify that data copied to a backup CD is the same as the original -- or at least spot deviations from the original -- as soon as possible. If a backup is ever needed, it will be really needed.
Because data CDs are accessible as simple directory trees, we are once again in the realm of tree walkers -- to verify a backup CD, we simply need to walk its top-level directory. We've already written a generic walker class (the visitor module), but it won't help us here directly: we need to walk two directories in parallel and inspect common files along the way. Moreover, walking either one of the two directories won't allow us to spot files and directories that only exist in the other. Something more custom seems in order here.
5.8.1 Finding Directory Differences
Before we start coding, the first thing we need to clarify is what it means to compare two directory trees. If both trees have exactly the same branch structure and depth, this problem reduces to comparing corresponding files in each tree. In general, though, the trees can have arbitrarily different shapes, depth, and so on.
More generally, the contents of a directory in one tree may have more or fewer entries than the corresponding directory in the other tree. If those differing contents are filenames, there is no corresponding file to compare; if they are directory names, there is no corresponding branch to descend through. In fact, the only way to detect files and directories that appear in one tree but not the other is to detect differences in each level's directory.
In other words, a tree comparison algorithm will also have to perform directory comparisons along the way. Because this is a nested, and simpler operation, let's start by coding a single-directory comparison of filenames in Example 5-24.
Example 5-24. PP2ESystemFiletoolsdirdiff.py
#!/bin/env python ######################################################## # use: python dirdiff.py dir1-path dir2-path # compare two directories to find files that exist # in one but not the other; this version uses the # os.listdir function and list difference; note # that this script only checks filename, not file # contents--see diffall.py for an extension that # does the latter by comparing .read( ) results; ######################################################## import os, sys def reportdiffs(unique1, unique2, dir1, dir2): if not (unique1 or unique2): print 'Directory lists are identical' else: if unique1: print 'Files unique to', dir1 for file in unique1: print '...', file if unique2: print 'Files unique to', dir2 for file in unique2: print '...', file def unique(seq1, seq2): uniques = [] # return items in seq1 only for item in seq1: if item not in seq2: uniques.append(item) return uniques def comparedirs(dir1, dir2): print 'Comparing', dir1, 'to', dir2 files1 = os.listdir(dir1) files2 = os.listdir(dir2) unique1 = unique(files1, files2) unique2 = unique(files2, files1) reportdiffs(unique1, unique2, dir1, dir2) return not (unique1 or unique2) # true if no diffs def getargs( ): try: dir1, dir2 = sys.argv[1:] # 2 command-line args except: print 'Usage: dirdiff.py dir1 dir2' sys.exit(1) else: return (dir1, dir2) if __name__ == '__main__': dir1, dir2 = getargs( ) comparedirs(dir1, dir2)
Given listings of names in two directories, this script simply picks out unique names in the first, unique names in the second, and reports any unique names found as differences (that is, files in one directory but not the other). Its comparedirs function returns a true result if no differences were found -- useful for detecting differences in callers.
Let's run this script on a few directories; differences are detected and reported as names unique in either passed-in directory pathname. Notice that this is only a structural comparison that just checks names in listings, not file contents (we'll add the latter in a moment):
C: emp>python %X%systemfiletoolsdirdiff.py examples cpexamples Comparing examples to cpexamples Directory lists are identical C: emp>python %X%systemfiletoolsdirdiff.py examplesPyTools cpexamplesPyTools Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py C: emp>python %X%systemfiletoolsdirdiff.py examplesSystemFiletools cpexamplesSystemFiletools Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py
The unique function is the heart of this script: it performs a simple list difference operation. Here's how it works apart from the rest of this script's code:
>>> L1 = [1, 3, 5, 7, 9] >>> L2 = [2, 3, 6, 8, 9] >>> from dirdiff import unique >>> unique(L1, L2) # items in L1 but not in L2 [1, 5, 7] >>> unique(L2, L1) # items in L2 but not in L1 [2, 6, 8]
These two lists have objects 3 and 9 in common; the rest appear only in one of the two. When applied to directories, unique items represent tree differences, and common items are names of files or subdirectories that merit further comparisons or traversal. There are other ways to check this code; see the dirdiff variants in the book's CD for a few.
5.8.2 Finding Tree Differences
Now all we need is a tree walker that applies dirdiff at each level to pick out unique files and directories, explicitly compares the contents of files in common, and descends through directories in common. Example 5-25 fits the bill.
Example 5-25. PP2ESystemFiletoolsdiffall.py
######################################################### # Usage: "python diffall.py dir1 dir2". # recursive tree comparison--report files that exist # in only dir1 or dir2, report files of same name in # dir1 and dir2 with differing contents, and do the # same for all subdirectories of the same names in # and below dir1 and dir2; summary of diffs appears # at end of output (but search redirected output for # "DIFF" and "unique" strings for further details); ######################################################### import os, dirdiff def intersect(seq1, seq2): commons = [] # items in seq1 and seq2 for item in seq1: if item in seq2: commons.append(item) return commons def comparedirs(dir1, dir2, diffs, verbose=0): # compare filename lists print '-'*20 if not dirdiff.comparedirs(dir1, dir2): diffs.append('unique files at %s - %s' % (dir1, dir2)) print 'Comparing contents' files1 = os.listdir(dir1) files2 = os.listdir(dir2) common = intersect(files1, files2) # compare contents of files in common for file in common: path1 = os.path.join(dir1, file) path2 = os.path.join(dir2, file) if os.path.isfile(path1) and os.path.isfile(path2): bytes1 = open(path1, 'rb').read( ) bytes2 = open(path2, 'rb').read( ) if bytes1 == bytes2: if verbose: print file, 'matches' else: diffs.append('files differ at %s - %s' % (path1, path2)) print file, 'DIFFERS' # recur to compare directories in common for file in common: path1 = os.path.join(dir1, file) path2 = os.path.join(dir2, file) if os.path.isdir(path1) and os.path.isdir(path2): comparedirs(path1, path2, diffs, verbose) if __name__ == '__main__': dir1, dir2 = dirdiff.getargs( ) mydiffs = [] comparedirs(dir1, dir2, mydiffs) # changes mydiffs in-place print '='*40 # walk, report diffs list if not mydiffs: print 'No diffs found.' else: print 'Diffs found:', len(mydiffs) for diff in mydiffs: print '-', diff
At each directory in the tree, this script simply runs the dirdiff tool to detect unique names, and then compares names in common by intersecting directory lists. Since we've already studied the tree-walking tools this script employs, let's jump right into a few example runs. When run on identical trees, status messages scroll during the traversal, and a "No diffs found" message appears at the end:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples -------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents -------------------- Comparing examplesold_Part2 to cpexamplesold_Part2 Directory lists are identical Comparing contents -------------------- ...more lines deleted... -------------------- Comparing examplesEmbExtRegist to cpexamplesEmbExtRegist Directory lists are identical Comparing contents -------------------- Comparing examplesPyTools to cpexamplesPyTools Directory lists are identical Comparing contents ======================================== No diffs found.
To show how differences are reported, we need to generate a few. Let's run the global search-and-replace script we met earlier, to change a few files scattered about one of the trees (see -- I told you global replacement could trash your files!):
C: empexamples>python %X%PyToolsvisitor_replace.py -exec SPAM Are you sure?y ... 1355 => .PyToolsvisitor_find_quiet1.py 1356 => .PyToolsfixeoln_one.doc.txt Visited 1356 files Changed 8 files: .package.csh .README-PP2E.txt . eadme-old-pp1E.txt . emp . emp .InternetCgi-Webfixcgi.py .SystemProcessesoutput.txt .PyToolscleanpyc.py
While we're at it, let's remove a few common files so directory uniqueness differences show up on the scope too; the following three removal commands will make two directories differ (the last two commands impact the same directory in different trees):
C: emp>rm cpexamplesPyToolsvisitor.py C: emp>rm cpexamplesSystemFiletoolsdirdiff2.py C: emp>rm examplesSystemFiletoolscpall.py
Now, rerun the comparison walker to pick out differences, and pipe its output report to a file for easy inspection. The following lists just the parts of the output report that identify differences. In typical use, I inspect the summary at the bottom of the report first, and then search for strings "DIFF" and "unique" in the report's text if I need more information about the differences summarized:
C: emp>python %X%systemfiletoolsdiffall.py examples cpexamples > diffs C: emp>type diffs -------------------- Comparing examples to cpexamples Directory lists are identical Comparing contents package.csh DIFFERS README-PP2E.txt DIFFERS readme-old-pp1E.txt DIFFERS temp DIFFERS remp DIFFERS -------------------- Comparing examplesold_Part2 to cpexamplesold_Part2 Directory lists are identical Comparing contents -------------------- ... -------------------- Comparing examplesInternetCgi-Web to cpexamplesInternetCgi-Web Directory lists are identical Comparing contents fixcgi.py DIFFERS -------------------- ... -------------------- Comparing examplesSystemFiletools to cpexamplesSystemFiletools Files unique to examplesSystemFiletools ... dirdiff2.py Files unique to cpexamplesSystemFiletools ... cpall.py Comparing contents -------------------- ... -------------------- Comparing examplesSystemProcesses to cpexamplesSystemProcesses Directory lists are identical Comparing contents output.txt DIFFERS -------------------- ... -------------------- Comparing examplesPyTools to cpexamplesPyTools Files unique to examplesPyTools ... visitor.py Comparing contents cleanpyc.py DIFFERS ======================================== Diffs found: 10 - files differ at examplespackage.csh - cpexamplespackage.csh - files differ at examplesREADME-PP2E.txt - cpexamplesREADME-PP2E.txt - files differ at examples eadme-old-pp1E.txt - cpexamples eadme-old-pp1E.txt - files differ at examples emp - cpexamples emp - files differ at examples emp - cpexamples emp - files differ at examplesInternetCgi-Webfixcgi.py - cpexamplesInternetCgi-Webfixcgi.py - unique files at examplesSystemFiletools - cpexamplesSystemFiletools - files differ at examplesSystemProcessesoutput.txt - cpexamplesSystemProcessesoutput.txt - unique files at examplesPyTools - cpexamplesPyTools - files differ at examplesPyToolscleanpyc.py - cpexamplesPyToolscleanpyc.py
I added line breaks and tabs in a few of these output lines to make them fit on this page, but the report is simple to understand. Ten differences were found -- the eight files we changed (trashed) with the replacement script, and the two directories we threw out of sync with the three rm remove commands.
5.8.2.1 Verifying CD backups
So how does this script placate CD backup paranoia? To double-check my CD writer's work, I run a command like the following. I can also use a command like this to find out what has been changed since the last backup. Again, since the CD is "G:" on my machine when plugged in, I provide a path rooted there; use a root such as /dev/cdrom on Linux:
C: emp>python %X%systemfiletoolsdiffall.py examples g:PP2ndEdexamplesPP2E > exdiffs091500 C: emp>more exdiffs091500 -------------------- Comparing examples to g:PP2ndEdexamplesPP2E Files unique to examples ... .cshrc Comparing contents tounix.py DIFFERS -------------------- Comparing examplesold_Part2 to g:PP2ndEdexamplesPP2Eold_Part2 Directory lists are identical Comparing contents -------------------- ...more visitor_fixeoln.py DIFFERS visitor_fixnames.py DIFFERS ======================================== Diffs found: 41 - unique files at examples - g:PP2ndEdexamplesPP2E - files differ at examples ounix.py - g:PP2ndEdexamplesPP2E ounix.py ...more
The CD spins, the script compares, and a summary of 41 differences appears at the end of the report (in this case, representing changes to the examples tree since the latest backup CD was burned). For an example of a full difference report, see file exdiffs091500 on the book's CD. More typically, this is what turns up for most of my example backups -- files with a leading "." are not copied to the CD:
C: emp>python %X%SystemFiletoolsdiffall.py examples g:PP2ndEdexamplesPP2E ... -------------------- Comparing examplesConfig to g:PP2ndEdexamplesPP2EConfig Files unique to examplesConfig ... .cshrc Comparing contents ======================================== Diffs found: 1 - unique files at examplesConfig - g:PP2ndEdexamplesPP2EConfig
And to really be sure, I run the following global comparison command against the true book directory, to verify the entire book tree backup on CD:
C:>python %X%SystemFiletoolsdiffall.py PP2ndEd G:PP2ndEd -------------------- Comparing PP2ndEd to G:PP2ndEd Files unique to G:PP2ndEd ... examples.tar.gz Comparing contents README.txt DIFFERS -------------------- ...more -------------------- Comparing PP2ndEdexamplesPP2EConfig to G:PP2ndEdexamplesPP2EConfig Files unique to PP2ndEdexamplesPP2EConfig ... .cshrc Comparing contents -------------------- ...more -------------------- Comparing PP2ndEdchapters to G:PP2ndEdchapters Directory lists are identical Comparing contents ch01-intro.doc DIFFERS ch04-os3.doc DIFFERS ch05-gui1.doc DIFFERS ch06-gui2.doc DIFFERS -------------------- ...more ======================================== Diffs found: 11 - unique files at PP2ndEd - G:PP2ndEd - files differ at PP2ndEdREADME.txt - G:PP2ndEdREADME.txt ...more
This particular run indicates that I've changed a "readme" file, four chapter files, and a bunch more since the last backup; if run immediately after making a backup, only the .cshrc unique file shows up on diffall radar. This global comparison can take a few minutes -- it performs byte-for-byte comparisons of all chapter files and screen shots, the examples tree, an image of the book's CD, and more, but it's an accurate and complete verification. Given that this book tree contained roughly 119M of data in 7300 files and 570 directories the last time I checked, a more manual verification procedure without Python's help would be utterly impossible.
Finally, it's worth noting that this script still only detects differences in the tree, but does not give any further details about individual file differences. In fact, it simply loads and compares the binary contents of corresponding files with a single string comparison -- it's a simple yes/no result.[11] If and when I need more details about how two reported files actually differ, I either edit the files, or run the file-comparison command on the host platform (e.g., fc on Windows/DOS, diff or cmp on Unix and Linux). That's not a portable solution for this last step; but for my purposes, just finding the differences in a 1300-file tree was much more critical than reporting which lines differ in files flagged in the report.
[11] We might try to do a bit better here, by opening text files in text mode to ignore line-terminator differences, but it's not clear that such differences should be blindly ignored (what if the caller wants to know if line-end markers have been changed?). We could also be smarter by avoiding the load and compare steps for files that differ in size, and read files in small chunks, instead of all at once, to minimize memory requirements for huge files (see earlier examples such as the cpall script for hints). For my comparisons, such optimizations are unnecessary.
Of course, since we can always run shell commands in Python, this last step could be automated by spawning a diff or fc command with os.popen as differences are encountered (or after the traversal, by scanning the report summary). Because Python excels at processing files and strings, though, it's possible to go one step further and code a Python equivalent of the fc and diff commands. Since this is beyond both this script's scope and this chapter's size limits, that will have to await the attention of a curious reader.