The robotparser Module
(New in 2.0) The robotparser module reads robots.txt files, which are used to implement the Robot Exclusion Protocol (http://info.webcrawler.com/mak/projects/robots/robots.html).
If you're implementing an HTTP robot that will visit arbitrary sites on the Net (not just your own sites), it's a good idea to use this module to check that you really are welcome. Example 7-21 demonstrates the robotparser module.
Example 7-21. Using the robotparser Module
File: robotparser-example-1.py
import robotparser
r = robotparser.RobotFileParser()
r.set_url("http://www.python.org/robots.txt")
r.read()
if r.can_fetch("*", "/index.html"):
print "may fetch the home page"
if r.can_fetch("*", "/tim_one/index.html"):
print "may fetch the tim peters archive"
may fetch the home page
Категории