The robotparser Module

(New in 2.0) The robotparser module reads robots.txt files, which are used to implement the Robot Exclusion Protocol (http://info.webcrawler.com/mak/projects/robots/robots.html).

If you're implementing an HTTP robot that will visit arbitrary sites on the Net (not just your own sites), it's a good idea to use this module to check that you really are welcome. Example 7-21 demonstrates the robotparser module.

Example 7-21. Using the robotparser Module

File: robotparser-example-1.py import robotparser r = robotparser.RobotFileParser() r.set_url("http://www.python.org/robots.txt") r.read() if r.can_fetch("*", "/index.html"): print "may fetch the home page" if r.can_fetch("*", "/tim_one/index.html"): print "may fetch the tim peters archive" may fetch the home page

Категории