>>> from env_helper import info; info()

页面更新时间： 2024-01-20 22:01:19
运行环境：
    Linux发行版本: Debian GNU/Linux 12 (bookworm)
    操作系统内核: Linux-6.1.0-17-amd64-x86_64-with-glibc2.36
    Python版本: 3.11.2

5.4. 网页信息获取与解析¶

大数据的时代信息已近覆盖了社会体系中的各行各业，大数据是指在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合，需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。要对网页信息进行处理就是大势所趋了。

前面我们已近了解过了html的基本组成，现在我们使用Python读取一个HTML的具体信息。Python有许多处理网页的库，没个库的功能都足够强大，简单的介绍一下处理网页的库。

requests -对HTTP协议进行高度封装，支持非常丰富的链接访问功能。
PySpider -一个国人编写的强大的网络爬虫系统并带有强大的WebUI。
bs4 -beautifulsoup4库，用于解析和处理HTML和XML。
Scrapy- 很强大的爬虫框架，用于抓取网站并从其页面中提取结构化数据。可用于从数据挖掘到监控和自动化测试的各种用途
Crawley -高速爬取对应网站的内容，支持关系和非关系数据库，数据可以导出为JSON、XML等
Portia -可视化爬取网页内容
cola -分布式爬虫框架
newspaper -提取新闻、文章以及内容分析
lxml -lxml是python的一个解析库，这个库支持HTML和xml的解析，支持XPath的解析方式

5.4.1. 安装requests¶

找一个感兴趣的网页进行测试。不过在这之前要将第三方库安装好。

通过requests库爬取html页面的内容。

安装 requests 可以使用pip安装

pip install requests

在 Debian / Ubutnu 中使用

sudo apt install -y python3-requests

完成后测试：

>>> import requests

5.4.2. requests主要方法¶

requests.request() 构造一个请求，支持以下各种方法
requests.get() 可以像指定服务器发送get请求
requests.head() 获取html头部信息的主要方法
requests.post() 向html网页提交post请求的方法
requests.put() 向html网页提交put请求的方法
requests.patch() 向html提交局部修改的请求
requests.delete() 向html提交删除请求

requests.get()¶

requests.get(url,params,**kwargs)

url: 需要爬取的网站地址。
params: url中的额外参数，字典或者字节流格式，可选。
**kwargs : 控制访问的参数
- data：字典，字节序或文件对象，重点作为向服务器提供或提交资源是提交，，作为request的内容，与params不同的是，data提交的数据并不放在url链接里，而是放在url链接对应位置的地方作为数据来存储。，它也可以接受一个字符串对象。
- json：json格式的数据， json合适在相关的html，http相关的web开发中非常常见，也是http最经常使用的数据格式，他是作为内容部分可以向服务器提交。
- cookies：字典或CookieJar，指的是从http中解析cookie
- auth：元组，用来支持http认证功能,可以输入用户名，以及密码。
- files：字典，是用来向服务器传输文件时使用的字段。
- timeout: 用于设定超时时间，单位为秒，当发起一个get请求时可以设置一个timeout时间，如果在timeout时间内请求内容没有返回，将产生一个timeout的异常。
- proxies：字典，用来设置访问代理服务器。
- allow_redirects: 开关，表示是否允许对url进行重定向，默认为True。
- stream: 开关，指是否对获取内容进行立即下载，默认为True。
- verify：开关，用于认证SSL证书，默认为True。
- cert：用于设置保存本地SSL证书路径

以下示例为向 “https://www.osgeo.cn” 对应的服务器发送相应的get请求，获得对应的内容。

>>> url='https://www.osgeo.cn/'
>>> html = requests.get(url, timeout=10000)
>>> demo = html.text

>>> html.status_code

html是请求返回的一个响应对象。主要的属性有五个分别是：

html.status_code http请求的返回状态，若为200则表示请求成功。
html.text http响应内容的字符串形式，即返回的页面内容
html.encoding 从http header中猜测的相应内容编码方式
html.apparent_encoding 从内容中分析出的响应内容编码方式（备选编码方式）
html.content http响应内容的二进制形式

request.head()¶

返回响应头信息。

>>> requests.head(url).headers

{'Server': 'nginx/1.18.0', 'Date': 'Sat, 20 Jan 2024 14:01:31 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Content-Length': '87', 'Connection': 'keep-alive'}

requests.post()¶

可以传递参数，你也许经常想为 URL 的查询字符串(query string)传递某种数据。如果你是手工构建 URL，那么数据会以键/值对的形式置于 URL 中，跟在一个问号的后面。例如，httpbin.org/get?key=val

>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.post("http://httpbin.org/post", data=payload)
>>> r.text

'{n  "args": {}, n  "data": "", n  "files": {}, n  "form": {n    "key1": "value1", n    "key2": "value2"n  }, n  "headers": {n    "Accept": "/", n    "Accept-Encoding": "gzip, deflate, br", n    "Content-Length": "23", n    "Content-Type": "application/x-www-form-urlencoded", n    "Host": "httpbin.org", n    "User-Agent": "python-requests/2.28.1", n    "X-Amzn-Trace-Id": "Root=1-65abd23d-19403c074bfb7cf32f0f1b5c"n  }, n  "json": null, n  "origin": "47.93.236.110", n  "url": "http://httpbin.org/post"n}n'

二进制响应内容¶

Requests 可以自动将返回的值解析为二进制内容,可以用于下载图片以及视频资源。

>>> r.content

b'{n  "args": {}, n  "data": "", n  "files": {}, n  "form": {n    "key1": "value1", n    "key2": "value2"n  }, n  "headers": {n    "Accept": "/", n    "Accept-Encoding": "gzip, deflate, br", n    "Content-Length": "23", n    "Content-Type": "application/x-www-form-urlencoded", n    "Host": "httpbin.org", n    "User-Agent": "python-requests/2.28.1", n    "X-Amzn-Trace-Id": "Root=1-65abd23d-19403c074bfb7cf32f0f1b5c"n  }, n  "json": null, n  "origin": "47.93.236.110", n  "url": "http://httpbin.org/post"n}n'

JSON 响应内容¶

Requests 中也有一个内置的 JSON 解码器，助你处理 JSON 数据：

>>> r.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'key1': 'value1', 'key2': 'value2'},
 'headers': {'Accept': '/',
  'Accept-Encoding': 'gzip, deflate, br',
  'Content-Length': '23',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.28.1',
  'X-Amzn-Trace-Id': 'Root=1-65abd23d-19403c074bfb7cf32f0f1b5c'},
 'json': None,
 'origin': '47.93.236.110',
 'url': 'http://httpbin.org/post'}

原始响应内容¶

在罕见的情况下，你可能想获取来自服务器的原始套接字响应，那么你可以访问 r.raw。如果你确实想这么干，那请你确保在初始请求中设置了 stream=True。具体你可以这么做：

>>> r = requests.get('https://api.github.com/events', stream=True)
>>> r.raw

<urllib3.response.HTTPResponse at 0x7fa5a41d7a60>

>>> r.raw.read(10)

b'x1fx8bx08x00x00x00x00x00x00x03'

5.3. CSS

5.5. Web页面解析

Python 3 教程 文档