A News-to-Mail Gateway The last code example of this chapter is a custom news-to-mail gateway. It periodically scans Netnews for articles of interest, bundles them into a MIME message, and mails them via Internet mail. Each time the script is run it keeps track of the messages it has previously sent and only sends messages that haven't been seen before. You control the script's scope by specifying a list of newsgroups and, optionally , one or more patterns to search for in the subject lines of the articles contained in the newsgroups. If you don't specify any subject-line patterns, the script fetches the entire contents of the listed newsgroups. The subject-line patterns take advantage of Perl's pattern-matching engine, and can be any regular expression. For performance reasons, however, we use the built-in NNTP wildcard patterns for newsgroup names . The following command searches the comp.lang.perl.* newsgroups for articles that have the word "Socket" or "socket" in the subject line. Matching articles will be mailed to the local e-mail address lstein . Options include -subject , to specify the subject pattern match, -mail to set the mail recipient(s), and -v to turn on verbose progress messages. % scan_newsgroups.pl -v -mail lstein -subj '[sS]ocket' 'comp.lang.perl.*' Searching comp.lang.perl.misc for matches Fetching overview for comp.lang.perl.misc found 39 matching articles Searching comp.lang.perl.announce for matches Fetching overview for comp.lang.perl.announce found 0 matching articles Searching comp.lang.perl.tk for matches Fetching overview for comp.lang.perl.tk found 1 matching articles Searching comp.lang.perl.modules for matches Fetching overview for comp.lang.perl.modules found 4 matching articles 44 articles, 40 unseen sending e-mail message to lstein The received e-mail message contains a brief prologue that describes the search and newsgroup patterns, followed by the matching articles. Each article is attached as an enclosure of MIME type message/rfc822 . Depending on the reader's mail-reading software, the enclosures are displayed as either in-line components of the message or attachments. The result is particularly nice in the Netscape mail reader (Figure 8.8) because each article is displayed using fancy fonts and hyperlinks . Figure 8.8. E-mail message sent from scan_newsgroups.pl Figure 8.9 lists the code for scan_newsgroups.pl . Figure 8.9. The scan_newsgroup.pl script Lines 1 “7: Load modules We load the Net::NNTP and MIME::Entity modules, as well as the Getopt::Long module for argument processing. We need to keep track of all the messages that we have found during previous runs of the script, and the easiest way to do that is to keep the message IDs in an indexed DBM database. However, we don't know a priori what DBM library is available, so we import the AnyDBM_File module, which chooses a library for us. The code contained in the BEGIN{} block changes the DBM library search order, as described in the AnyDBM_File documentation. We also load the Fcntl module in order to have access to several constants needed to initialize the DBM file. Lines 9 “22: Define constants We choose a name for the DBM file, a file named .newscache in the user 's home directory, and create a usage message. Lines 23 “25: Declare globals The first line of globals correspond to command-line options. The second line of globals are various data structures manipulated by the script. The %Seen hash will be tied to the DBM file. Its keys are the message IDs of articles that we have previously retrieved. %Articles contains information about the articles recovered during the current search. Its keys are message IDs, and its values are hash references of header fields derived from the overview index. Last, @Fields contains the list of header fields returned by the xover() method. Lines 26 “34: Process command-line arguments We call GetOptions() to process the command-line options, and then check consistency of the arguments. If the e-mail recipient isn't explicitly given on the command line, we default to the user's login name. Lines 35 “36: Open connection to Netnews server We open a connection to the Netnews server by calling Net::NNTP->new() . If the server isn't explicitly given on the command line, the $SERVER option is undefined and Net::NNTP picks a suitable default. Lines 37 “39: Open DBM file We tie %Seen to the .newscache file using the AnyDBM_File module. The options passed to tie() cause the file to be opened read/write and to be created with file mode 0640 (-rw-r-----) , if it doesn't already exist. Lines 40 “41: Compile the pattern match For efficiency's sake, we compile the pattern matches into an anonymous subroutine. This subroutine takes the text of a subject line and returns true if all the patterns match, and false otherwise . The match_code() subroutine takes the list of pattern matches, compiles them, and returns an appropriate code reference. Lines 42 “43: Expand newsgroup patterns We pass the list of newsgroups to a subroutine named expand_newsgroups() . It calls the NNTP server to expand the wildcards in the list of newsgroups and returns the expanded list of newsgroup names. Lines 44 “45: Search for matching articles We loop through the expanded list of newsgroups and call grep_group() for each one. The arguments to grep_group() consist of the newsgroup name and a code reference to filter them. Internally, grep_group() accumulates the matched articles' message IDs into the %Articles hash. We do it this way because the same article may be cross-posted to several related newsgroups; using the article IDs in a hash avoids accumulating duplicates. Lines 46 “48: Filter out articles already seen We use Perl's grep() function to filter out articles whose message IDs are already present in the tied %Seen hash. New article IDs are added to the hash so that on subsequent runs we will know that we've seen them. The unseen article IDs are assigned to the @to_fetch array. If the user ran the script with the -all option, we short-circuit the grep() operation so that all articles are retrieved, including those we've seen before. This does not affect the updating of the tied %Seen hash. Lines 49 “52: Add articles to an outgoing mail message and quit We pass the list of article IDs to send_mail() , which retrieves their contents and adds them to an outgoing mail message. We then call the NNTP object's quit() method to disconnect from the server, and exit ourselves . Lines 53 “62: The match_code() subroutine The match_code() subroutine takes a list of zero or more patterns and constructs a code reference on the fly. The subroutine is built up line-by-line in a scalar variable called $code . The subroutine is designed to return true only if all the patterns match the passed subject line. If no patterns are specified, the subroutine returns true by default. If the -insensitive option was passed to the script, we do case-insensitive pattern matches with the i flag. Otherwise, we do case-sensitive matches. After constructing the subroutine code, we eval() it and return the result to the caller. If the eval() fails (presumably because of an error in one or more of the regular expressions), we propagate the error message and die. Lines 63 “73: The expand_newsgroups() subroutine The expand_newsgroups() , subroutine takes a list of newsgroup patterns and calls the NNTP object's newsgroups() method on each of them in turn, expanding them to a list of valid newsgroup names. If a newsgroup contains no wildcards, we just pass it back unchanged. Lines 74 “85: The grep_group() subroutine grep_group() scans the specified newsgroup for articles whose subject lines match a set of patterns. The patterns are provided in the form of a code reference that returns true if the subject line matches. We call the get_overview() subroutine to return the server's overview index for the newsgroup. get_overview() returns a hash reference in which each key is a message number and each value is a hash of indexed header fields. We step through each message, recover its Subject: and Message-ID: fields, and pass the subject field to the pattern-matching code reference. If the code reference returns false, we go on to the next article. Otherwise, we add the article's message ID and overview data to the %Articles global. When all articles have been examined, we return to the caller the number of those that matched. Lines 89 “102: The get_overview() subroutine The get_overview() subroutine used here is a slight improvement over the version shown earlier. We start by calling the NNTP object's group () method, recovering the newsgroup's first and last message numbers . We then call the object's overview_fmt() method to retrieve the names of the fields in the overview index. Since this information isn't going to change during the lifetime of the script, however, we cache it in the @Fields global and call overview_fmt() only if the global is empty. Before assigning to @Fields , we clean up the field names by removing the ":" and anything following it. We recover the overview for the entire newsgroup by calling the xover() method for the range spanning the first and last article numbers. We now loop through the keys of the returned overview hash, replacing its array reference values, which lists fields by position, with anonymous hashes that list fields by name. In addition to recording the header fields that occur in the article itself, we record a pseudofield named Message-Number: that contains the group name and message number in the form group.name:number . We use this information during e-mail construction to create the default name for the article enclosure. Lines 103 “124: The send_mail() subroutine send_mail() is called with an array of article IDs to fetch, and is responsible for constructing a multipart MIME message containing each article as an attachment. We create a short message prologue that summarizes the program's run-time options and create a new MIME::Entity by calling the build() method. The message starts as a single-part message of type text/plain , but is automatically promoted to a multipart message as soon as we start attaching articles to it. We then call attach_article() for each article listed in $to_fetch . This array may be empty, in which case we make no attachments. When all articles have been attached, we call the MIME entity's smtpsend() method to send out the mail using the Mail::Mailer SMTP method, and clean up any temporary files by calling the entity's purge() method. Lines 125 “134: The attach_article() subroutine For the indicated message ID we fetch the entire article's contents as an array of lines by calling the NNTP object's article() method. We then attach the article to the outgoing mail message, specifying a MIME type of message/rfc822 , a description corresponding to the article's subject line, and a suggested filename derived from the article's newsgroup and message number (taken from the global %Articles hash). An interesting feature of this script is the fact that because we are storing unique global message IDs in the .newscache hashed database, we can switch to a different NNTP server without worrying about retrieving articles we have already seen. |