File Tools

2017-11-03 09:05:09

External files are at the heart of much of what we do with shell utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It's a core programming concept.

In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. Technically, open gives direct access to the stdio filesystem calls in the system's C library -- it returns a new file object that is connected to the external file, and has methods that map more or less directly to file calls on your machine. The open function also provides a portable interface to the underlying filesystem -- it works the same on every platform Python runs on.

Other file-related interfaces in Python allow us to do things such as manipulate lower-level descriptor-based files (module os), store objects away in files by key (modules anydbm and shelve), and access SQL databases. Most of these are larger topics addressed in Chapter 16. In this section, we take a brief tutorial look at the built-in file object, and explore a handful of more advanced file-related topics. As usual, you should consult the library manual's file object entry for further details and methods we don't have space to cover here.

2.11.1 Built-in File Objects

For most purposes, the open function is all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines), writing data (write, writelines), freeing system resources (close), moving about in the file (seek), forcing data to be transferred out of buffers (flush), fetching the underlying file handle (fileno), and more. Since the built-in file object is so easy to use, though, let's jump right in to a few interactive examples.

2.11.1.1 Output files

To make a new file, call open with two arguments: the external name of the file to be created, and a mode string "w" (short for "write"). To store data on the file, call the file object's write method with a string containing the data to store, and then call the close method to close the file if you wish to open it again within the same program or session:

C: emp>python >>> file = open('data.txt', 'w') # open output file object: creates >>> file.write('Hello file world! ') # writes strings verbatim >>> file.write('Bye file world. ') >>> file.close( ) # closed on gc and exit too

And that's it -- you've just generated a brand new text file on your computer, no matter which computer you type this code on:

C: emp>dir data.txt /B data.txt C: emp>type data.txt Hello file world! Bye file world.

There is nothing unusual about the new file at all; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI too.

2.11.1.1.1 Opening

In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string; if we pass just a simple filename without a path, the file will appear in Python's current working directory. That is, it shows up in the place where the code is run -- here, directory C: emp on my machine is implied by the bare filename data.txt, so this really creates a file at C: empdata.txt. See Section 2.7 earlier in this chapter for a refresher on this topic.

Also note that when opening in "w" mode, Python either creates the external file if it does not yet exist, or erases the file's current contents if it is already present on your machine (so be careful out there).

2.11.1.1.2 Writing

Notice that we added an explicit end-of-line character to lines written to the file; unlike the print statement, file write methods write exactly what they are passed, without any extra formatting. The string passed to write shows up byte-for-byte on the external file.

Output files also sport a writelines method, which simply writes all the strings in a list one at a time, without any extra formatting added. For example, here is a writelines equivalent to the two write calls shown earlier:

file.writelines(['Hello file world! ', 'Bye file world. '])

This call isn't as commonly used (and can be emulated with a simple for loop), but is convenient in scripts that save output in a list to be written later.

2.11.1.1.3 Closing

The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (i.e., when it is no longer referenced), and when the Python session or program exits. Because of that, close calls are often optional. In fact, it's common to see file-processing code in Python like this:

open('somefile.txt').write("G'day Bruce ")

Since this expression makes a temporary file object, writes to it immediately, and does not save a reference to it, the file object is reclaimed and closed right away without ever having called the close method explicitly.

But note that it's not impossible that this auto-close on reclaim file feature may change in future Python releases. Moreover, the JPython Java-based Python implementation discussed later does not reclaim files as immediately as the standard Python system (it uses Java's garbage collector). If your script makes many files and your platform limits the number of open files per program, explicit close calls are a robust habit to form.

2.11.1.2 Input files

Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of "r" (for "read") or no mode flag at all (it defaults to "r" if omitted). Once opened, we can read the lines of a text file with the readlines method:

>>> file = open('data.txt', 'r') # open input file object >>> for line in file.readlines( ): # read into line string list ... print line, # lines have ' ' at end ... Hello file world! Bye file world.

The readlines method loads the entire contents of the file into memory, and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:

file.read( ) returns a string containing all the bytes stored in the file.

file.read(N) returns a string containing the next N bytes from the file.

file.readline( ) reads through the next and returns a line string.

file.readlines( ) reads the entire file and returns a list of line strings.

Let's run these method calls to read files, lines, and bytes:

>>> file.seek(0) # go back to the front of file >>> file.read( ) # read entire file into string 'Hello file world!12Bye file world.12' >>> file.seek(0) >>> file.readlines( ) ['Hello file world!12', 'Bye file world.12'] >>> file.seek(0) >>> file.readline( ) 'Hello file world!12' >>> file.readline( ) 'Bye file world.12' >>> file.seek(0) >>> file.read(1), file.read(8) ('H', 'ello fil')

All these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:

read( ) and readlines( ) load the entire file into memory all at once. That makes them handy for grabbing a file's contents with as little code as possible. It also makes them very fast, but costly for huge files -- loading a multi-gigabyte file into memory is not generally a good thing to do.

On the other hand, because the readline( ) and read(N) calls fetch just part of the file (the next line, or N-byte block), they are safer for potentially big files, but a bit less convenient, and usually much slower. If speed matters and your files aren't huge, read or readlines may be better choices.

By the way, the seek(0) call used repeatedly here means "go back to the start of the file." In files, all read and write operations take place at the current position; files normally start at offset when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation. Python's seek method also accepts an optional second argument having one of three values -- 0 for absolute file positioning (the default), 1 to seek relative to the the current position, and 2 to seek relative to the file's end. When seek is passed only an offset argument as above, it's roughly a file rewind operation.

2.11.1.3 Other file object modes

Besides "w" and "r", most platforms support an "a" open mode string, meaning "append." In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:

>>> file = open('data.txt', 'a') # open in append mode: doesn't erase >>> file.write('The Life of Brian') # added at end of existing data >>> file.close( ) >>> >>> open('data.txt').read( ) # open and read entire file 'Hello file world!12Bye file world.12The Life of Brian'

Most files are opened using the sorts of calls we just ran, but open actually allows up to three arguments for more specific processing needs -- the filename, the open mode, and a buffer size. All but the first of these are optional: if omitted, the open mode argument defaults to "r" (input), and the buffer size policy is to enable buffering on most platforms. Here are a few things you should know about all three open arguments:

Filename

As mentioned, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described earlier). In general, any filename form you can type in your system shell will work in an open call. For instance, a filename argument r'.. empspam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory's parent -- up one, and down to directory temp.

Open mode

The open function accepts other modes too, some of which are not demonstrated in this book (e.g., r+, w+, and a+ to open for updating, and any mode string with a "b" to designate binary mode). For instance, mode r+ means both reads and writes are allowed on the file, and wb writes data in binary mode (more on this in the next section). Generally, whatever you could use as a mode string in the C language's fopen call on your platform will work in the Python open function, since it really just calls fopen internally. (If you don't know C, don't sweat this point.) Notice that the contents of files are always strings in Python programs regardless of mode: read methods return a string, and we pass a string to write methods.

Buffer size

The open call also takes an optional third buffer size argument, which lets you control stdio buffering for the file -- the way that data is queued up before being transferred to boost performance. If passed, means file operations are unbuffered (data is transferred immediately), 1 means they are line buffered, any other positive value means use a buffer of approximately that size, and a negative value means to use the system default (which you get if no third argument is passed, and generally means buffering is enabled). The buffer size argument works on most platforms, but is currently ignored on platforms that don't provide the sevbuf system call.

2.11.1.4 Binary data files

The preceding examples all process simple text files. On most platforms, Python scripts can also open and process files containing binary data -- JPEG images, audio clips, and anything else that can be stored in files. The primary difference in terms of code is the mode argument passed to the built-in open function:

>>> file = open('data.txt', 'wb') # open binary output file >>> file = open('data.txt', 'rb') # open binary input file

Once you've opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. (readline and readlines don't make sense here, though: binary data isn't line-oriented.)

In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. This works because Python string objects can always contain character bytes of any value (though some may look odd if printed). Interestingly, even a byte of value zero can be embedded in a Python string; it's called in escape-code notation, and does not terminate strings in Python as it does in C. For instance:

>>> data = "abc" >>> data 'a00b00c' >>> len(data) 5

Instead of relying on a terminator character, Python keeps track of a string's length explicitly. Here, data references a string of length 5, that happens to contain two zero-value bytes; they print in octal escape form as 00. Because no character codes are reserved, it's okay to read binary data with zero bytes (and other values) into a string in Python.

2.11.1.5 End-of-line translations on Windows

Strictly speaking, on some platforms you may not need the "b" at the end of the open mode argument to process binary files; the "b" is simply ignored, so modes "r" and "w" work just as well. In fact, the "b" in mode flag strings is usually only required for binary files on Windows. To understand why, though, you need to know how lines are terminated in text files.

For historical reasons, the end of a line of text in a file is represented by different characters on different platforms: it's a single character on Unix and Linux, but the two-character sequence on Windows.[9] That's why files moved between Linux and Windows may look odd in your text editor after transfer -- they may still be stored using the original platform's end-of-line convention. For example, most Windows editors handle text in Unix format, but Notepad is a notable exception -- text files copied from Unix or Linux usually look like one long line when viewed in Notepad, with strange characters inside ().

[9] Actually, it gets worse: on the Mac, lines in text files are terminated with a single (not or ). Whoever said proprietary software was good for the consumer probably wasn't speaking about users of multiple platforms, and certainly wasn't talking about programmers.

Python scripts don't normally need to care, because the Windows port (really, the underlying C compiler on Windows) automatically maps the DOS sequence to a single . It works like this -- when scripts are run on Windows:

For files opened in text mode, is translated to when input.

For files opened in text mode, is translated to when output.

For files opened in binary mode, no translation occurs on input or output.

On Unix-like platforms, no translations occur, regardless of open modes.

There are two important consequences of all these rules to keep in mind. First, the end of line character is almost always represented as a single in all Python scripts, regardless of how it is stored in external files on the underlying platform. By mapping to and from on input and output, the Windows port hides the platform-specific difference.

The second consequence of the mapping is more subtle: if you mean to process binary data files on Windows, you generally must be careful to open those files in binary mode ("rb", "wb"), not text mode ("r", "w"). Otherwise, the translations listed previously could very well corrupt data as it is input or output. It's not impossible that binary data would by chance contain bytes with values the same as the DOS end-line characters, and . If you process such binary files in text mode on Windows, bytes may be incorrectly discarded when read, and bytes may be erroneously expanded to when written. The net effect is that your binary data will be trashed when read and written -- probably not quite what you want! For example, on Windows:

>>> len('ab c d') # 4 escape code bytes 8 >>> open('temp.bin', 'wb').write('ab c d') # write binary data to file >>> open('temp.bin', 'rb').read( ) # intact if read as binary 'a00b15c1512d' >>> open('temp.bin', 'r').read( ) # loses a in text mode! 'a00b15c12d' >>> open('temp.bin', 'w').write('ab c d') # adds a in text mode! >>> open('temp.bin', 'rb').read( ) 'a00b15c151512d'

This is only an issue when running on Windows, but using binary open modes "rb" and "wb" for binary files everywhere won't hurt on other platforms, and will help make your scripts more portable (you never know when a Unix utility may wind up seeing action on your PC).

There are other times you may want to use binary file open modes too. For instance, in Chapter 5, we'll meet a script called fixeoln_one that translates between DOS and Unix end-of-line character conventions in text files. Such a script also has to open text files in binary mode to see what end-of-line characters are truly present on the file; in text mode, they would already be translated to by the time they reached the script.

2.11.2 File Tools in the os Module

The os module contains an additional set of file-processing functions that are distinct from the built-in file object tools demonstrated in previous examples. For instance, here is a very partial list of os file-related calls:

os.open( path, flags, mode)

Opens a file, returns its descriptor

os.read( descriptor, N)

Reads at most N bytes, returns a string

os.write( descriptor, string)

Writes bytes in string to the file

os.lseek( descriptor, position)

Moves to position in the file

Technically, os calls process files by their descriptors -- integer codes or "handles" that identify files in the operating system. Because the descriptor-based file tools in os are lower-level and more complex than the built-in file objects created with the built-in open function, you should generally use the latter for all but very special file-processing needs.[10]

[10] For instance, to process pipes, described in Chapter 3. The Python pipe call returns two file descriptors, which can be processed with os module tools or wrapped in a file object with os.fdopen.

To give you the general flavor of this tool-set, though, let's run a few interactive experiments. Although built-in file objects and os module descriptor files are processed with distinct toolsets, they are in fact related -- the stdio filesystem used by file objects simply adds a layer of logic on top of descriptor-based files.

In fact, the fileno file object method returns the integer descriptor associated with a built-in file object. For instance, the standard stream file objects have descriptors 0, 1, and 2; calling the os.write function to send data to stdout by descriptor has the same effect as calling the sys.stdout.write method:

>>> import sys >>> for stream in (sys.stdin, sys.stdout, sys.stderr): ... print stream.fileno( ), ... 0 1 2 >>> sys.stdout.write('Hello stdio world ') # write via file method Hello stdio world >>> import os >>> os.write(1, 'Hello descriptor world ') # write via os module Hello descriptor world 23

Because file objects we open explicitly behave the same way, it's also possible to process a given real external file on the underlying computer, through the built-in open function, tools in module os, or both:

>>> file = open(r'C: empspam.txt', 'w') # create external file >>> file.write('Hello stdio file ') # write via file method >>> >>> fd = file.fileno( ) >>> print fd 3 >>> os.write(fd, 'Hello descriptor file ') # write via os module 22 >>> file.close( ) >>> C:WINDOWS>type c: empspam.txt # both writes show up Hello descriptor file Hello stdio file

2.11.2.1 Open mode flags

So why the extra file tools in os? In short, they give more low-level control over file processing. The built-in open function is easy to use, but is limited by the underlying stdio filesystem that it wraps -- buffering, open modes, and so on, are all per stdio defaults.[11] Module os lets scripts be more specific; for example, the following opens a descriptor-based file in read-write and binary modes, by performing a binary "or" on two mode flags exported by os:

[11] To be fair to the built-in file object, the open function accepts a mode "rb+", which is equivalent to the combined mode flags used here, and can also be made nonbuffered with a buffer size argument. Whenever possible, use open, not os.open.

>>> fdfile = os.open(r'C: empspam.txt', (os.O_RDWR | os.O_BINARY)) >>> os.read(fdfile, 20) 'Hello descriptor fil' >>> os.lseek(fdfile, 0, 0) # go back to start of file 0 >>> os.read(fdfile, 100) # binary mode retains " " 'Hello descriptor file1512Hello stdio file1512' >>> os.lseek(fdfile, 0, 0) 0 >>> os.write(fdfile, 'HELLO') # overwrite first 5 bytes 5

On some systems, such open flags let us specify more advanced things like exclusive access (O_EXCL) and nonblocking modes (O_NONBLOCK) when a file is opened. Some of these flags are not portable across platforms (another reason to use built-in file objects most of the time); see the library manual or run a dir(os) call on your machine for an exhaustive list of other open flags available.

We saw earlier how to go from file object to field descriptor with the fileno file method; we can also go the other way -- the os.fdopen call wraps a file descriptor in a file object. Because conversions work both ways, we can generally use either tool set -- file object, or os module:

>>> objfile = os.fdopen(fdfile) >>> objfile.seek(0) >>> objfile.read( ) 'HELLO descriptor file1512Hello stdio file1512'

2.11.2.2 Other os file tools

The os module also includes an assortment of file tools that accept a file pathname string, and accomplish file-related tasks such as renaming (os.rename), deleting (os.remove), and changing the file's owner and permission settings (os.chown, os.chmod). Let's step through a few examples of these tools in action:

>>> os.chmod('spam.txt', 0777) # enabled all accesses

This os.chmod file permissions call passes a nine-bit bitstring, composed of three sets of three bits each. From left to right, the three sets represent the file's owning user, the file's group, and all others. Within each set, the three bits reflect read, write, and execute access permissions. When a bit is "1" in this string, it means that the corresponding operation is allowed for the assessor. For instance, octal 0777 is a string of nine "1" bits in binary, so it enables all three kinds of accesses, for all three user groups; octal 0600 means that the file can be only read and written by the user that owns it (when written in binary, 0600 octal is really bits 110 000 000).

This scheme stems from Unix file permission settings, but works on Windows as well. If it's puzzling, either check a Unix manpage for chmod, or see the fixreadonly example in Chapter 5, for a practical application (it makes read-only files copied off a CD-ROM writable).

>>> os.rename(r'C: empspam.txt', r'C: empeggs.txt') # (from, to) >>> >>> os.remove(r'C: empspam.txt') # delete file Traceback (innermost last): File "", line 1, in ? OSError: [Errno 2] No such file or directory: 'C:\temp\spam.txt' >>> >>> os.remove(r'C: empeggs.txt')

The os.rename call used here changes a file's name; the os.remove file deletion call deletes a file from your system, and is synonymous with os.unlink; the latter reflects the call's name on Unix, but was obscure to users of other platforms. The os module also exports the stat system call:

>>> import os >>> info = os.stat(r'C: empspam.txt') >>> info (33206, 0, 2, 1, 0, 0, 41, 968133600, 968176258, 968176193) >>> import stat >>> info[stat.ST_MODE], info[stat.ST_SIZE] (33206, 41) >>> mode = info[stat.ST_MODE] >>> stat.S_ISDIR(mode), stat.S_ISREG(mode) (0, 1)

The os.stat call returns a tuple of values giving low-level information about the named file, and the stat module exports constants and functions for querying this information in a portable way. For instance, indexing an os.stat result on offset stat.ST_SIZE returns the file's size, and calling stat.S_ISDIR with the mode item from an os.stat result checks whether the file is a directory. As shown earlier, though, both of these operations are available in the os.path module too, so it's rarely necessary to use os.stat except for low-level file queries:

>>> path = r'C: empspam.txt' >>> os.path.isdir(path), os.path.isfile(path), os.path.getsize(path) (0, 1, 41)

2.11.3 File Scanners

Unlike some shell-tool languages, Python doesn't have an implicit file-scanning loop procedure, but it's simple to write a general one that we can reuse for all time. The module in Example 2-11 defines a general file-scanning routine, which simply applies a passed-in Python function to each line in an external file.

Example 2-11. PP2ESystemFiletoolsscanfile.py

def scanner(name, function): file = open(name, 'r') # create a file object while 1: line = file.readline( ) # call file methods if not line: break # until end-of-file function(line) # call a function object file.close( )

The scanner function doesn't care what line-processing function is passed in, and that accounts for most of its generality -- it is happy to apply any single-argument function that exists now or in the future to all the lines in a text file. If we code this module and put it in a directory on PYTHONPATH, we can use it any time we need to step through a file line-by-line. Example 2-12 is a client script that does simple line translations.

Example 2-12. PP2ESystemFiletoolscommands.py

#!/usr/local/bin/python from sys import argv from scanfile import scanner def processLine(line): # define a function if line[0] == '*': # applied to each line print "Ms.", line[1:-1] elif line[0] == '+': print "Mr.", line[1:-1] # strip 1st and last char else: raise 'unknown command', line # raise an exception filename = 'data.txt' if len(argv) == 2: filename = argv[1] # allow file name cmd arg scanner(filename, processLine) # start the scanner

If, for no readily obvious reason, the text file hillbillies.txt contains the following lines:

*Granny +Jethro *Elly-Mae +"Uncle Jed"

then our commands script could be run as follows:

C:...PP2ESystemFiletools>python commands.py hillbillies.txt Ms. Granny Mr. Jethro Ms. Elly-Mae Mr. "Uncle Jed"

As a rule of thumb, though, we can usually speed things up by shifting processing from Python code to built-in tools. For instance, if we're concerned with speed (and memory space isn't tight), we can make our file scanner faster by using the readlines method to load the file into a list all at once, instead of the manual readline loop in Example 2-11:

def scanner(name, function): file = open(name, 'r') # create a file object for line in file.readlines( ): # get all lines at once function(line) # call a function object file.close( )

And if we have a list of lines, we can work more magic with the map built-in function. Here's a minimalist's version; the for loop is replaced by map, and we let Python close the file for us when it is garbage-collected (or the script exits):

def scanner(name, function): map(function, open(name, 'r').readlines( ))

But what if we also want to change a file while scanning it? Example 2-13 shows two approaches: one uses explicit files, and the other uses the standard input/output streams to allow for redirection on the command line.

Example 2-13. PP2ESystemFiletoolsfilters.py

def filter_files(name, function): # filter file through function input = open(name, 'r') # create file objects output = open(name + '.out', 'w') # explicit output file too for line in input.readlines( ): output.write(function(line)) # write the modified line input.close( ) output.close( ) # output has a '.out' suffix def filter_stream(function): import sys # no explicit files while 1: # use standard streams line = sys.stdin.readline( ) # or: raw_input( ) if not line: break print function(line), # or: sys.stdout.write( ) if __name__ == '__main__': filter_stream(lambda line: line) # copy stdin to stdout if run

Since the standard streams are preopened for us, they're often easier to use. This module is more useful when imported as a library (clients provide the line-processing function); when run standalone it simply parrots stdin to stdout:

C:...PP2ESystemFiletools>python filters.py < ..System.txt This directory contains operating system interface examples. Many of the examples in this unit appear elsewhere in the examples distribution tree, because they are actually used to manage other programs. See the README.txt files in the subdirectories here for pointers.

Brutally observant readers may notice that this last file is named filters.py (with an "s"), not filter.py. I originally named it the latter, but changed its name when I realized that a simple import of the filename (e.g., "import filter") assigns the module to a local name "filter," thereby hiding the built-in filter function. This is a built-in functional programming tool, not used very often in typical scripts; but be careful to avoid picking built-in names for module files. I will if you will.

2.11.4 Making Files Look Like Lists

One last file-related trick has proven popular enough to merit an introduction here. Although file objects only export method calls (e.g., file.read( )), it is easy to use classes to make them look more like data structures, and hide some of the underlying file call details. The module in Example 2-14 defines a FileList object that "wraps" a real file to add sequential indexing support.

Example 2-14. PP2ESystemFiletoolsfilelist.py

class FileList: def __init__(self, filename): self.file = open(filename, 'r') # open and save file def __getitem__(self, i): # overload indexing line = self.file.readline( ) if line: return line # return the next line else: raise IndexError # end 'for' loops, 'in' def __getattr__(self, name): return getattr(self.file, name) # other attrs from real file

This class defines three specially named methods:

The __init__ method is called whenever a new object is created.

The __getitem__ method intercepts indexing operations.

The __getattr__ method handles undefined attribute references.

This class mostly just extends the built-in file object to add indexing. Most standard file method calls are simply delegated (passed off) to the wrapped file by __getattr__. Each time a FileList object is indexed, though, its __getitem__ method returns the next line in the actual file. Since for loops work by repeatedly indexing objects, this class lets us iterate over a wrapped file as though it were an in-memory list:

>>> from filelist import FileList >>> for line in FileList('hillbillies.txt'): ... print '>', line, ... > *Granny > +Jethro > *Elly-Mae > +"Uncle Jed"

This class could be made much more sophisticated and list-like too. For instance, we might overload the + operation to concatenate a file onto the end of an output file, allow random indexing operations that seek among the file's lines to resolve the specified offset, and so on. But since coding all such extensions takes more space than I have available here, I'll leave them as suggested exercises.

Категории