File Tools
External files are at the heart of much of what we do with shell utilities. For instance, a testing system may read its inputs from one file, store program results in another file, and check expected results by loading yet another file. Even user interface and Internet-oriented programs may load binary images and audio clips from files on the underlying computer. It's a core programming concept.
In Python, the built-in open function is the primary tool scripts use to access the files on the underlying computer system. Since this function is an inherent part of the Python language, you may already be familiar with its basic workings. Technically, open gives direct access to the stdio filesystem calls in the system's C library -- it returns a new file object that is connected to the external file, and has methods that map more or less directly to file calls on your machine. The open function also provides a portable interface to the underlying filesystem -- it works the same on every platform Python runs on.
Other file-related interfaces in Python allow us to do things such as manipulate lower-level descriptor-based files (module os), store objects away in files by key (modules anydbm and shelve), and access SQL databases. Most of these are larger topics addressed in Chapter 16. In this section, we take a brief tutorial look at the built-in file object, and explore a handful of more advanced file-related topics. As usual, you should consult the library manual's file object entry for further details and methods we don't have space to cover here.
2.11.1 Built-in File Objects
For most purposes, the open function is all you need to remember to process files in your scripts. The file object returned by open has methods for reading data (read, readline, readlines), writing data (write, writelines), freeing system resources (close), moving about in the file (seek), forcing data to be transferred out of buffers (flush), fetching the underlying file handle (fileno), and more. Since the built-in file object is so easy to use, though, let's jump right in to a few interactive examples.
2.11.1.1 Output files
To make a new file, call open with two arguments: the external name of the file to be created, and a mode string "w" (short for "write"). To store data on the file, call the file object's write method with a string containing the data to store, and then call the close method to close the file if you wish to open it again within the same program or session:
C: emp>python >>> file = open('data.txt', 'w') # open output file object: creates >>> file.write('Hello file world! ') # writes strings verbatim >>> file.write('Bye file world. ') >>> file.close( ) # closed on gc and exit too
And that's it -- you've just generated a brand new text file on your computer, no matter which computer you type this code on:
C: emp>dir data.txt /B data.txt C: emp>type data.txt Hello file world! Bye file world.
There is nothing unusual about the new file at all; here, I use the DOS dir and type commands to list and display the new file, but it shows up in a file explorer GUI too.
2.11.1.1.1 Opening
In the open function call shown in the preceding example, the first argument can optionally specify a complete directory path as part of the filename string; if we pass just a simple filename without a path, the file will appear in Python's current working directory. That is, it shows up in the place where the code is run -- here, directory C: emp on my machine is implied by the bare filename data.txt, so this really creates a file at C: empdata.txt. See Section 2.7 earlier in this chapter for a refresher on this topic.
Also note that when opening in "w" mode, Python either creates the external file if it does not yet exist, or erases the file's current contents if it is already present on your machine (so be careful out there).
2.11.1.1.2 Writing
Notice that we added an explicit end-of-line character to lines written to the file; unlike the print statement, file write methods write exactly what they are passed, without any extra formatting. The string passed to write shows up byte-for-byte on the external file.
Output files also sport a writelines method, which simply writes all the strings in a list one at a time, without any extra formatting added. For example, here is a writelines equivalent to the two write calls shown earlier:
file.writelines(['Hello file world! ', 'Bye file world. '])
This call isn't as commonly used (and can be emulated with a simple for loop), but is convenient in scripts that save output in a list to be written later.
2.11.1.1.3 Closing
The file close method used earlier finalizes file contents and frees up system resources. For instance, closing forces buffered output data to be flushed out to disk. Normally, files are automatically closed when the file object is garbage collected by the interpreter (i.e., when it is no longer referenced), and when the Python session or program exits. Because of that, close calls are often optional. In fact, it's common to see file-processing code in Python like this:
open('somefile.txt').write("G'day Bruce ")
Since this expression makes a temporary file object, writes to it immediately, and does not save a reference to it, the file object is reclaimed and closed right away without ever having called the close method explicitly.
|
2.11.1.2 Input files
Reading data from external files is just as easy as writing, but there are more methods that let us load data in a variety of modes. Input text files are opened with either a mode flag of "r" (for "read") or no mode flag at all (it defaults to "r" if omitted). Once opened, we can read the lines of a text file with the readlines method:
>>> file = open('data.txt', 'r') # open input file object >>> for line in file.readlines( ): # read into line string list ... print line, # lines have ' ' at end ... Hello file world! Bye file world.
The readlines method loads the entire contents of the file into memory, and gives it to our scripts as a list of line strings that we can step through in a loop. In fact, there are many ways to read an input file:
- file.read( ) returns a string containing all the bytes stored in the file.
- file.read(N) returns a string containing the next N bytes from the file.
- file.readline( ) reads through the next and returns a line string.
- file.readlines( ) reads the entire file and returns a list of line strings.
Let's run these method calls to read files, lines, and bytes:
>>> file.seek(0) # go back to the front of file >>> file.read( ) # read entire file into string 'Hello file world! 12Bye file world. 12' >>> file.seek(0) >>> file.readlines( ) ['Hello file world! 12', 'Bye file world. 12'] >>> file.seek(0) >>> file.readline( ) 'Hello file world! 12' >>> file.readline( ) 'Bye file world. 12' >>> file.seek(0) >>> file.read(1), file.read(8) ('H', 'ello fil')
All these input methods let us be specific about how much to fetch. Here are a few rules of thumb about which to choose:
- read( ) and readlines( ) load the entire file into memory all at once. That makes them handy for grabbing a file's contents with as little code as possible. It also makes them very fast, but costly for huge files -- loading a multi-gigabyte file into memory is not generally a good thing to do.
- On the other hand, because the readline( ) and read(N) calls fetch just part of the file (the next line, or N-byte block), they are safer for potentially big files, but a bit less convenient, and usually much slower. If speed matters and your files aren't huge, read or readlines may be better choices.
By the way, the seek(0) call used repeatedly here means "go back to the start of the file." In files, all read and write operations take place at the current position; files normally start at offset when opened and advance as data is transferred. The seek call simply lets us move to a new position for the next transfer operation. Python's seek method also accepts an optional second argument having one of three values -- 0 for absolute file positioning (the default), 1 to seek relative to the the current position, and 2 to seek relative to the file's end. When seek is passed only an offset argument as above, it's roughly a file rewind operation.
2.11.1.3 Other file object modes
Besides "w" and "r", most platforms support an "a" open mode string, meaning "append." In this output mode, write methods add data to the end of the file, and the open call will not erase the current contents of the file:
>>> file = open('data.txt', 'a') # open in append mode: doesn't erase >>> file.write('The Life of Brian') # added at end of existing data >>> file.close( ) >>> >>> open('data.txt').read( ) # open and read entire file 'Hello file world! 12Bye file world. 12The Life of Brian'
Most files are opened using the sorts of calls we just ran, but open actually allows up to three arguments for more specific processing needs -- the filename, the open mode, and a buffer size. All but the first of these are optional: if omitted, the open mode argument defaults to "r" (input), and the buffer size policy is to enable buffering on most platforms. Here are a few things you should know about all three open arguments:
Filename
As mentioned, filenames can include an explicit directory path to refer to files in arbitrary places on your computer; if they do not, they are taken to be names relative to the current working directory (described earlier). In general, any filename form you can type in your system shell will work in an open call. For instance, a filename argument r'.. empspam.txt' on Windows means spam.txt in the temp subdirectory of the current working directory's parent -- up one, and down to directory temp.
Open mode
The open function accepts other modes too, some of which are not demonstrated in this book (e.g., r+, w+, and a+ to open for updating, and any mode string with a "b" to designate binary mode). For instance, mode r+ means both reads and writes are allowed on the file, and wb writes data in binary mode (more on this in the next section). Generally, whatever you could use as a mode string in the C language's fopen call on your platform will work in the Python open function, since it really just calls fopen internally. (If you don't know C, don't sweat this point.) Notice that the contents of files are always strings in Python programs regardless of mode: read methods return a string, and we pass a string to write methods.
Buffer size
The open call also takes an optional third buffer size argument, which lets you control stdio buffering for the file -- the way that data is queued up before being transferred to boost performance. If passed, means file operations are unbuffered (data is transferred immediately), 1 means they are line buffered, any other positive value means use a buffer of approximately that size, and a negative value means to use the system default (which you get if no third argument is passed, and generally means buffering is enabled). The buffer size argument works on most platforms, but is currently ignored on platforms that don't provide the sevbuf system call.
2.11.1.4 Binary data files
The preceding examples all process simple text files. On most platforms, Python scripts can also open and process files containing binary data -- JPEG images, audio clips, and anything else that can be stored in files. The primary difference in terms of code is the mode argument passed to the built-in open function:
>>> file = open('data.txt', 'wb') # open binary output file >>> file = open('data.txt', 'rb') # open binary input file
Once you've opened binary files in this way, you may read and write their contents using the same methods just illustrated: read, write, and so on. (readline and readlines don't make sense here, though: binary data isn't line-oriented.)
In all cases, data transferred between files and your programs is represented as Python strings within scripts, even if it is binary data. This works because Python string objects can always contain character bytes of any value (though some may look odd if printed). Interestingly, even a byte of value zero can be embedded in a Python string; it's called