Text Processing in Python
| There are a variety of Internet-related modules in the standard library that will not be covered here in their specific usage. In the first place, there are two general aspects to writing Internet applications. The first aspect is the parsing, processing, and generation of messages that conform to various protocol requirements. These tasks are solidly inside the realm of text processing and should be covered in this book. The second aspect, however, are the issues of actually sending a message "over the wire": choosing ports and network protocols, handshaking, validation, and so on. While these tasks are important, they are outside the scope of this book. The synopses below will point you towards appropriate modules, though; the standard documentation, Python interactive help, or other texts can help with the details. A second issue comes up also, moreover. As Internet standards usually canonicalized in RFCs have evolved, and as Python libraries have become more versatile and robust, some newer modules have superceded older ones. In a similar way, for example, the re module replaced the older regex module. In the interests of backwards compatibility, Python has not dropped any Internet modules from its standard distributions. Nonetheless, the email module represents the current "best practice" for most tasks related to email and newsgroup message handling. The modules mimify, mimetools, MimeWriter, multifile, and rfc822 are likely to be utilized in existing code, but for new applications, it is better to use the capabilities in email in their stead. As well as standard library modules, a few third-party tools deserve special mention (at the bottom of this section). A large number of Python developers have created tools for various Internet-related tasks, but a small number of projects have reached a high degree of sophistication and a widespread usage. 5.3.1 Standard Internet-Related Tools
asyncore
Asynchronous socket service clients and servers. Cookie
Manage Web browser cookies. Cookies are a common mechanism for managing state in Web-based applications. RFC-2109 and RFC-2068 describe the encoding used for cookies, but in practice MSIE is not very standards compliant, so the parsing is relaxed in the Cookie module. SEE ALSO: cgi 376; httplib 396; email.Charset
Work with character set encodings at a fine-tuned level. Other modules within the email package utilize this module to provide higher-level interfaces. If you need to dig deeply into character set conversions, you might want to use this module directly. SEE ALSO: email 345; email.Header 351; unicode 423; codecs 189; ftplib
Support for implementing custom File Transfer Protocol (FTP) clients. This protocol is detailed in RFC-959. For a full FTP application, ftplib provides a very good starting point; for the simple capability to retrieve publicly accessible files over FTP, urIIib.urlopen() is more direct. SEE ALSO: urllib 388; urllib2 398; gopherlib
Gopher protocol client interface. As much as I am still personally fond of the gopher protocol, it is used so rarely that it is not worth documenting here. httplib
Support for implementing custom Web clients. Higher-level access to the HTTP and HTTPS protocols than using raw sockets on ports 80 or 443, but lower-level, and more communications oriented, than using the higher-level urllib to access Web resources in a file-like way. SEE ALSO: urllib 388; socket 397; ic, icopen
Internet access configuration (Macintosh). icopen
Internet Config replacement for open() (Macintosh). imghdr
Recognize image file formats based on their first few bytes. mailcap
Examine the mailcap file on Unix-like systems. The files /etc/mailcap, /usr/etc/mailcap, /usr/local/etc/mailcap, and $HOME/.mailcap are typically used to configure MIME capabilities in client applications like mail readers and Web browsers (but less so now than a few years ago). See RFC-1524. mhlib
Interface to MH mailboxes. The MH format consists of a directory structure that mirrors the folder organization of messages. Each message is contained in its own file. While the MH format is in many ways better, the Unix mailbox format seems to be more widely used. Basic access to a single folder in an MH hierarchy can be achieved with the mailbox.MHMailbox class, which satisfies most working requirements. SEE ALSO: mailbox 372; email 345; mimetools
Various tools used by MIME-reading or MIME-writing programs. MimeWriter
Generic MIME writer. mimify
Mimification and unmimification of mail messages. netrc
Examine the netrc file on Unix-like systems. The file $HOME/.netrc is typically used to configure FTP clients. SEE ALSO: ftplib 395; urllib 388; nntplib
Support for Network News Transfer Protocol (NNTP) client applications. This protocol is defined in RFC-977. Although Usenet has a different distribution system from email, the message format of NNTP messages still follows the format defined in RFC-822. In particular, the email package, or the rfc822 module, are useful for creating and modifying news messages. SEE ALSO: email 345; rfc822 397; nsremote
Wrapper around Netscape OSA modules (Macintosh). rfc822
RFC-822 message manipulation class. The email package is intended to supercede rfc822, and it is better to use email for new application development. SEE ALSO: email 345; poplib 368; mailbox 372; smtplib 370; select
Wait on I/O completion, such as sockets. sndhdr
Recognize sound file formats based on their first few bytes. socket
Low-level interface to BSD sockets. Used to communicate with IP addresses at the level underneath protocols like HTTP, FTP, POP3, Telnet, and so on. SEE ALSO: ftplib 395; gopherlib 395; httplib 396; imaplib 366; nntplib 397; poplib 368; smtplib 370; telnetlib 397; SocketServer
Asynchronous I/O on sockets. Under Unix, pipes can also be monitored with select.socket supports SSL in recent Python versions. telnetlib
Support for implementing custom telnet clients. This protocol is detailed in RFC-854. While possibly useful for intranet applications, Telnet is an entirely unsecured protocol and should not really be used on the Internet. Secure Shell (SSH) is an encrypted protocol that otherwise is generally similar in capability to Telnet. There is no support for SSH in the Python standard library, but third-party options exist, such as pyssh. At worst, you can script an SSH client using a tool like the third-party pyexpect. urllib2
An enhanced version of the urllib module that adds specialized classes for a variety of protocols. The main focus of urllib2 is the handling of authentication and encryption methods. SEE ALSO: urllib 388; Webbrowser
Remote-control interfaces to some browsers. 5.3.2 Third-Party Internet Related Tools
There are many very fine Internet-related tools that this book cannot discuss, but to which no slight is intended. A good index to such tools is the relevant page at the Vaults of Parnassus: <http://py.vaults.ca/apyllo.py/812237977> Quixote
In brief, Quixote is a templating system for HTML delivery. More so than systems like PHP, ASP, and JSP to an extent, Quixote puts an emphasis on Web application structure more than page appearance. The home page for Quixote is <http://www.mems-exchange.org/software/quixote/> Twisted
To describe Twisted, it is probably best simply to quote from Twisted Matrix Laboratories' Web site <http://www.twistedmatrix.com/>: Twisted is a framework, written in Python, for writing networked applications. It includes implementations of a number of commonly used network services such as a Web server, an IRC chat server, a mail server, a relational database interface and an object broker. Developers can build applications using all of these services as well as custom services that they write themselves. Twisted also includes a user authentication system that controls access to services and provides services with user context information to implement their own security models. While Twisted overlaps significantly in purpose with Zope, Twisted is generally lower-level and more modular (which has both pros and cons). Some protocols supported by Twisted usually both server and client and implemented in pure Python are SSH; FTP; HTTP; NNTP; SOCKSv4; SMTP; IRC; Telnet; POP3; AOL's instant messaging TOC; OSCAR, used by AOL-IM as well as ICQ; DNS; MouseMan; finger; Echo, discard, chargen, and friends; Twisted Perspective Broker, a remote object protocol; and XML-RPC. Zope
Zope is a sophisticated, powerful, and just plain complicated Web application server. It incorporates everything from dynamic page generation, to database interfaces, to Web-based administration, to back-end scripting in several styles and languages. While the learning curve is steep, experienced Zope developers can develop and manage Web applications more easily, reliably, and faster than users of pretty much any other technology. The home page for Zope is <http://zope.org/>. |