`urllib.robotparser` ---robots.txt的分析器¶

源代码： Lib/urllib/robotparser.py

此模块提供一个类， RobotFileParser ，它回答有关特定用户代理是否可以在发布 robots.txt 文件。有关 robots.txt 文件，请参见http://www.robotstxt.org/orig.html。

class urllib.robotparser.RobotFileParser(url='')¶

这个类提供了阅读、分析和回答有关 robots.txt 文件在 url .

set_url(url)¶: 将URL设置为引用 robots.txt 文件。

read()¶: 阅读 robots.txt 并将其提供给解析器。

parse(lines)¶: 分析Lines参数。

can_fetch(useragent, url)¶: 返回 True 如果 字符串 允许获取 url 根据分析中包含的规则 robots.txt 文件。

mtime()¶: 返回 robots.txt 上次提取文件。这对于需要检查新的 robots.txt 定期存档。

modified()¶: 设置时间 robots.txt 文件上次被提取到当前时间。

crawl_delay(useragent)¶: 返回的值 Crawl-delay 参数从 robots.txt 对于 字符串 有问题。如果没有此类参数或它不适用于 字符串 指定的或 robots.txt 此参数的条目语法无效，返回 None .

3.6 新版功能.

request_rate(useragent)¶: 返回的内容 Request-rate 参数从 robots.txt 作为一个 named tuple RequestRate(requests, seconds) . 如果没有此类参数或它不适用于 字符串 指定的或 robots.txt 此参数的条目语法无效，返回 None .

3.6 新版功能.

site_maps()¶: 返回的内容 Sitemap 参数从 robots.txt 以A的形式 list() . 如果没有此类参数或 robots.txt 此参数的条目语法无效，返回 None .

3.8 新版功能.

下面的示例演示了 RobotFileParser 类：

>>> import urllib.robotparser
>>> rp = urllib.robotparser.RobotFileParser()
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
>>> rp.read()
>>> rrate = rp.request_rate("*")
>>> rrate.requests
3
>>> rrate.seconds
20
>>> rp.crawl_delay("*")
6
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
False
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
True

© 版权 2001-2021, Python Software Foundation.
This page is licensed under the Python Software Foundation License Version 2.
Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
See History and License for more information.

The Python Software Foundation is a non-profit corporation. Please donate.

最后更新于 10月 23, 2021. Found a bug?
Created using Sphinx 4.2.0.