如何使用urllib包获取Internet资源¶

作者: Michael Foord

注解

本指南的早期版本有法文译本，可在 urllib2 - Le Manuel manquant .

介绍¶

urllib.request 是用于获取URL（统一资源定位器）的python模块。它提供了一个非常简单的界面， 乌尔松 功能。这可以使用各种不同的协议获取URL。它还提供了一个稍微复杂一点的接口来处理常见情况，如基本身份验证、cookie、代理等。这些由称为处理程序和开放器的对象提供。

urllib.request支持获取许多“url方案”的url（由前面的字符串标识） ":" 在URL中-例如 "ftp" 是的URL方案 "ftp://python.org/" ）使用它们的相关网络协议（如ftp、http）。本教程重点介绍最常见的情况HTTP。

对于简单的情况 乌尔松 很容易使用。但是，一旦在打开HTTP URL时遇到错误或不寻常的情况，您将需要对超文本传输协议有一些了解。对HTTP最全面和权威的引用是 RFC 2616 . 这是一份技术文件，不容易阅读。本文旨在说明如何使用 urllib 有足够的关于HTTP的详细信息来帮助您完成。它不打算取代 urllib.request 文档，但是对它们的补充。

提取网址¶

使用urllib.request的最简单方法如下：

import urllib.request
with urllib.request.urlopen('http://python.org/') as response:
   html = response.read()

如果希望通过URL检索资源并将其存储在临时位置，可以通过 shutil.copyfileobj() 和 tempfile.NamedTemporaryFile() 功能：：

import shutil
import tempfile
import urllib.request

with urllib.request.urlopen('http://python.org/') as response:
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        shutil.copyfileobj(response, tmp_file)

with open(tmp_file.name) as html:
    pass

urllib的许多用法都很简单（请注意，我们可以使用一个以“ftp:”，“file:”等开头的URL，而不是“http:”URL）。然而，本教程的目的是解释更复杂的情况，重点是HTTP。

HTTP基于请求和响应-客户机发出请求，服务器发送响应。urllib.request用 Request 对象，它表示您正在发出的HTTP请求。在其最简单的形式中，您创建一个请求对象，指定要获取的URL。调用 urlopen 使用此请求对象返回请求的URL的响应对象。此响应是一个类似文件的对象，这意味着您可以例如调用 .read() 回复：

import urllib.request

req = urllib.request.Request('http://www.voidspace.org.uk')
with urllib.request.urlopen(req) as response:
   the_page = response.read()

请注意，urllib.request使用相同的请求接口来处理所有URL方案。例如，您可以像这样发出一个ftp请求：

req = urllib.request.Request('ftp://example.com/')

对于HTTP，请求对象允许您做两件额外的事情：首先，您可以传递要发送到服务器的数据。其次，您可以传递额外的信息（“元数据”）。关于数据或about请求本身发送到服务器-此信息作为HTTP“头”发送。让我们依次看看这些。

数据¶

有时，您希望将数据发送到一个URL（通常该URL将引用CGI（公共网关接口）脚本或其他Web应用程序）。对于HTTP，这通常是使用 POST 请求。这通常是您的浏览器在提交在Web上填写的HTML表单时所做的操作。并非所有的帖子都必须来自表单：您可以使用帖子将任意数据传输到自己的应用程序。在HTML表单的常见情况下，数据需要以标准方式编码，然后作为 data 参数。编码是使用 urllib.parse 类库。：：

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
values = {'name' : 'Michael Foord',
          'location' : 'Northampton',
          'language' : 'Python' }

data = urllib.parse.urlencode(values)
data = data.encode('ascii') # data should be bytes
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

请注意，有时还需要其他编码（例如，从HTML表单上载文件-请参见 HTML Specification, Form Submission 了解更多详细信息）。

如果你不通过 data 参数，Urllib使用 GET 请求。GET和POST请求不同的一种方式是，POST请求通常具有“副作用”：它们以某种方式改变系统的状态（例如，通过向网站下订单，将百分之八的罐头垃圾邮件送到您的门前）。尽管HTTP标准明确指出，post的目的是总是产生副作用，并获得请求从未为了产生副作用，没有什么可以阻止get请求产生副作用，post请求也没有副作用。也可以通过在URL中对数据进行编码，在HTTP GET请求中传递数据。

具体如下：

>>> import urllib.request
>>> import urllib.parse
>>> data = {}
>>> data['name'] = 'Somebody Here'
>>> data['location'] = 'Northampton'
>>> data['language'] = 'Python'
>>> url_values = urllib.parse.urlencode(data)
>>> print(url_values)  # The order may differ from below.  
name=Somebody+Here&language=Python&location=Northampton
>>> url = 'http://www.example.com/example.cgi'
>>> full_url = url + '?' + url_values
>>> data = urllib.request.urlopen(full_url)

请注意，完整的URL是通过添加 ? 到URL，后跟编码值。

报头¶

我们将在这里讨论一个特定的HTTP头，以说明如何向HTTP请求添加头。

一些网站 1 不类似于被程序浏览或发送不同版本到不同的浏览器 2. 默认情况下，urllib将自身标识为 Python-urllib/x.y （何处） x 和 y 是Python版本的主要版本号和次要版本号，例如 Python-urllib/2.5 ，这可能会混淆网站，或只是简单的不工作。浏览器识别自身的方式是通过 User-Agent 页眉 3. 当您创建一个请求对象时，您可以在中传递一个标题字典。下面的示例发出与上面相同的请求，但将自身标识为Internet Explorer的版本 4. ：：

import urllib.parse
import urllib.request

url = 'http://www.someserver.com/cgi-bin/register.cgi'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name': 'Michael Foord',
          'location': 'Northampton',
          'language': 'Python' }
headers = {'User-Agent': user_agent}

data = urllib.parse.urlencode(values)
data = data.encode('ascii')
req = urllib.request.Request(url, data, headers)
with urllib.request.urlopen(req) as response:
   the_page = response.read()

响应也有两种有用的方法。请参见 info and geturl 这是在我们了解了事情发生时会发生什么之后。

处理异常¶

乌尔松 引发 URLError 当它无法处理响应时（尽管与PythonAPI一样，内置异常如 ValueError ， TypeError 等也可以提高）。

HTTPError 是的子类 URLError 在HTTP URL的特定情况下引发。

异常类从 urllib.error 模块。

URLError¶

通常，由于没有网络连接（没有到指定服务器的路由），或者指定的服务器不存在，会引发URLERROR。在这种情况下，引发的异常将具有“reason”属性，该属性是一个包含错误代码和文本错误消息的元组。

例如：：

>>> req = urllib.request.Request('http://www.pretend_server.org')
>>> try: urllib.request.urlopen(req)
... except urllib.error.URLError as e:
...     print(e.reason)      
...
(4, 'getaddrinfo failed')

HTTPError¶

来自服务器的每个HTTP响应都包含一个数字“状态代码”。有时状态代码指示服务器无法完成请求。默认的处理程序将为您处理其中的一些响应（例如，如果响应是一个“重定向”，请求客户机从不同的URL获取文档，URLLIB将为您处理该响应）。对于那些它无法处理的人，厄洛彭会提出一个 HTTPError . 典型错误包括“404”（未找到页面）、“403”（禁止请求）和“401”（需要验证）。

见第10节 RFC 2616 有关所有HTTP错误代码的参考。

这个 HTTPError 引发的实例将具有整数“code”属性，该属性与服务器发送的错误相对应。

错误代码¶

因为默认的处理程序处理重定向（300范围内的代码），100--299范围内的代码表示成功，所以通常只能看到400--599范围内的错误代码。

http.server.BaseHTTPRequestHandler.responses 是一个有用的响应代码字典，其中显示了 RFC 2616 . 为了方便起见，本词典在此再版：

# Table mapping response codes to messages; entries have the
# form {code: (shortmessage, longmessage)}.
responses = {
    100: ('Continue', 'Request received, please continue'),
    101: ('Switching Protocols',
          'Switching to new protocol; obey Upgrade header'),

    200: ('OK', 'Request fulfilled, document follows'),
    201: ('Created', 'Document created, URL follows'),
    202: ('Accepted',
          'Request accepted, processing continues off-line'),
    203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
    204: ('No Content', 'Request fulfilled, nothing follows'),
    205: ('Reset Content', 'Clear input form for further input.'),
    206: ('Partial Content', 'Partial content follows.'),

    300: ('Multiple Choices',
          'Object has several resources -- see URI list'),
    301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
    302: ('Found', 'Object moved temporarily -- see URI list'),
    303: ('See Other', 'Object moved -- see Method and URL list'),
    304: ('Not Modified',
          'Document has not changed since given time'),
    305: ('Use Proxy',
          'You must use proxy specified in Location to access this '
          'resource.'),
    307: ('Temporary Redirect',
          'Object moved temporarily -- see URI list'),

    400: ('Bad Request',
          'Bad request syntax or unsupported method'),
    401: ('Unauthorized',
          'No permission -- see authorization schemes'),
    402: ('Payment Required',
          'No payment -- see charging schemes'),
    403: ('Forbidden',
          'Request forbidden -- authorization will not help'),
    404: ('Not Found', 'Nothing matches the given URI'),
    405: ('Method Not Allowed',
          'Specified method is invalid for this server.'),
    406: ('Not Acceptable', 'URI not available in preferred format.'),
    407: ('Proxy Authentication Required', 'You must authenticate with '
          'this proxy before proceeding.'),
    408: ('Request Timeout', 'Request timed out; try again later.'),
    409: ('Conflict', 'Request conflict.'),
    410: ('Gone',
          'URI no longer exists and has been permanently removed.'),
    411: ('Length Required', 'Client must specify Content-Length.'),
    412: ('Precondition Failed', 'Precondition in headers is false.'),
    413: ('Request Entity Too Large', 'Entity is too large.'),
    414: ('Request-URI Too Long', 'URI is too long.'),
    415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
    416: ('Requested Range Not Satisfiable',
          'Cannot satisfy request range.'),
    417: ('Expectation Failed',
          'Expect condition could not be satisfied.'),

    500: ('Internal Server Error', 'Server got itself in trouble'),
    501: ('Not Implemented',
          'Server does not support this operation'),
    502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
    503: ('Service Unavailable',
          'The server cannot process the request due to a high load'),
    504: ('Gateway Timeout',
          'The gateway server did not receive a timely response'),
    505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
    }

当出现错误时，服务器通过返回HTTP错误代码进行响应。 and 错误页。你可以使用 HTTPError 实例作为页面上的响应返回。这意味着除了code属性之外，它还具有read、geturl和info方法，这些方法由 urllib.response 模块：

>>> req = urllib.request.Request('http://www.python.org/fish.html')
>>> try:
...     urllib.request.urlopen(req)
... except urllib.error.HTTPError as e:
...     print(e.code)
...     print(e.read())  
...
404
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
  ...
  <title>Page Not Found</title>\n
  ...

包扎起来¶

所以如果你想做好准备 HTTPError or URLError 有两种基本方法。我更类似于第二种方法。

1号¶

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
    response = urlopen(req)
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    # everything is fine

注解

这个 except HTTPError must 先来，否则 except URLError 将 also 抓住一个 HTTPError .

2号¶

from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)
else:
    # everything is fine

信息和geturl¶

由urlopen返回的响应（或 HTTPError 实例）有两种有用的方法 info() 和 geturl() 并在模块中定义 urllib.response …

geturl -这将返回获取的页面的实际URL。这很有用，因为 urlopen （或使用的Opener对象）可能遵循了重定向。获取的网页的URL可能与请求的URL不同。

info -这将返回一个类似字典的对象，描述获取的页面，特别是服务器发送的头。它目前是一个 http.client.HTTPMessage 实例。

典型的报头包括“content-length”、“content-type”等。见 Quick Reference to HTTP Headers 有关HTTP头的有用列表及其含义和用途的简要说明。

开瓶器和操作器¶

当您获取一个URL时，您使用一个Opener（一个名称可能很混乱的 urllib.request.OpenerDirector ）通常我们一直使用默认的开启器-通过 urlopen -但是您可以创建自定义的开放器。开瓶器使用处理程序。所有的“重物引发”都是由操作人员完成的。每个处理程序都知道如何打开特定URL方案（HTTP、FTP等）的URL，或者如何处理URL打开的某个方面，例如HTTP重定向或HTTP cookie。

如果要获取安装了特定处理程序的URL，例如获取处理cookie的Opener，或者获取不处理重定向的Opener，则需要创建Opener。

要创建开放器，请实例化 OpenerDirector 然后调用 .add_handler(some_handler_instance) 反复地。

或者，您可以使用 build_opener 这是一个使用单个函数调用创建Opener对象的方便函数。 build_opener 默认情况下添加多个处理程序，但提供了添加更多和/或重写默认处理程序的快速方法。

您可能希望处理的其他类型的处理程序可以处理代理、身份验证和其他常见但稍微专业化的情况。

install_opener 可以用来做一个 opener 对象为（全局）默认打开器。这意味着 urlopen 将使用已安装的开启器。

Opener对象具有 open 方法，可以直接调用该方法以与 urlopen 功能：无需调用 install_opener 除非为了方便。

基本身份验证¶

为了演示如何创建和安装处理程序，我们将使用 HTTPBasicAuthHandler . 有关此主题的更详细的讨论（包括对基本身份验证如何工作的解释）请参见 Basic Authentication Tutorial .

当需要身份验证时，服务器发送一个请求身份验证的头（以及401错误代码）。这指定了身份验证方案和“领域”。标题如下： WWW-Authenticate: SCHEME realm="REALM" .

例如

WWW-Authenticate: Basic realm="cPanel Users"

然后，客户端应使用作为请求头的领域的适当名称和密码重试请求。这是“基本身份验证”。为了简化这个过程，我们可以创建 HTTPBasicAuthHandler 以及使用这个处理程序的开场白。

这个 HTTPBasicAuthHandler 使用名为密码管理器的对象处理URL和领域到密码和用户名的映射。如果您知道领域是什么（从服务器发送的身份验证头），那么您可以使用 HTTPPasswordMgr . 人们常常不关心领域是什么。在这种情况下，使用起来很方便 HTTPPasswordMgrWithDefaultRealm . 这允许您指定URL的默认用户名和密码。这将在您没有为特定领域提供可选组合的情况下提供。我们通过提供 None 作为领域的参数 add_password 方法。

顶级URL是第一个需要身份验证的URL。URL比传递给的URL“深”。add_password（）也将匹配。：：

# create a password manager
password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()

# Add the username and password.
# If we knew the realm, we could use it instead of None.
top_level_url = "http://example.com/foo/"
password_mgr.add_password(None, top_level_url, username, password)

handler = urllib.request.HTTPBasicAuthHandler(password_mgr)

# create "opener" (OpenerDirector instance)
opener = urllib.request.build_opener(handler)

# use the opener to fetch a URL
opener.open(a_url)

# Install the opener.
# Now all calls to urllib.request.urlopen use our opener.
urllib.request.install_opener(opener)

注解

在上面的示例中，我们只提供了 HTTPBasicAuthHandler 到 build_opener .默认情况下，开放器具有用于正常情况的处理程序-- ProxyHandler （如果代理设置如 http_proxy 设置环境变量）。 UnknownHandler ， HTTPHandler ， HTTPDefaultErrorHandler ， HTTPRedirectHandler ， FTPHandler ， FileHandler ， DataHandler ， HTTPErrorProcessor .

top_level_url 事实上 任何一个 完整的URL（包括“http:”方案组件和主机名以及可选的端口号），例如 "http://example.com/" or 一个“权限”（即主机名，可选包括端口号），例如 "example.com" 或 "example.com:8080" （后一个示例包括端口号）。权限（如果存在）不能包含“userinfo”组件-例如 "joe:password@example.com" 是不对的。

代理人¶

urllib 将自动检测您的代理设置并使用这些设置。这是通过 ProxyHandler 当检测到代理设置时，它是正常处理程序链的一部分。通常这是一件好事，但有时它可能没有帮助 5. 一种方法是建立我们自己的 ProxyHandler ，未定义代理。这是使用类似的步骤来设置 Basic Authentication 处理程序：

>>> proxy_support = urllib.request.ProxyHandler({})
>>> opener = urllib.request.build_opener(proxy_support)
>>> urllib.request.install_opener(opener)

注解

目前 urllib.request 不支持提取 https 通过代理定位。但是，这可以通过扩展urllib.request来实现，如配方中所示。 6.

注解

HTTP_PROXY 如果变量 REQUEST_METHOD 已设置；请参阅上的文档 getproxies() .

Socket和层¶

从Web获取资源的python支持是分层的。ULLIB使用 http.client 库，反过来使用套接字库。

从python 2.3开始，您可以指定套接字在超时之前等待响应的时间。这在必须获取网页的应用程序中很有用。默认情况下，Socket模块具有 无超时 可以挂起来。目前，在http.client或urllib.request级别上，套接字超时没有公开。但是，可以使用以下命令全局设置所有套接字的默认超时：

import socket
import urllib.request

# timeout in seconds
timeout = 10
socket.setdefaulttimeout(timeout)

# this call to urllib.request.urlopen now uses the default timeout
# we have set in the socket module
req = urllib.request.Request('http://www.voidspace.org.uk')
response = urllib.request.urlopen(req)

脚注¶

本文件由John Lee审查和修订。

1: 例如谷歌。
2: 浏览器嗅探对于网站设计来说是一种非常糟糕的做法——使用Web标准构建网站更为明智。不幸的是，许多网站仍然向不同的浏览器发送不同的版本。
3: msie 6的用户代理是 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'
4: 有关更多HTTP请求头的详细信息，请参阅 Quick Reference to HTTP Headers .
5: 在我的例子中，我必须使用代理来访问工作中的互联网。如果你想取 本地服务器 通过这个代理的URL会阻止它们。IE被设置为使用URLLIB所使用的代理。为了在本地主机服务器上测试脚本，我必须防止urllib使用代理。
6: SSL代理的URLLIB开放器（连接方法）： ASPN Cookbook Recipe .