选择器¶

当你抓取网页时，你需要执行的最常见的任务是从HTML源代码中提取数据。有几个库可以实现这一点，例如：

BeautifulSoup 在Python程序员中是一个非常流行的Web抓取库，它基于HTML代码的结构构造了一个Python对象，并且能够很好地处理错误的标记，但是它有一个缺点：速度慢。
lxml 是一个XML解析库（它也解析HTML），使用基于 ElementTree . （LXML不是Python标准库的一部分。）

Scrapy有自己的数据提取机制。它们被称为选择器，因为它们“选择”HTML文档的某些部分 XPath 或 CSS 表达。

XPath 是一种在XML文档中选择节点的语言，也可以与HTML一起使用。 CSS 是用于将样式应用于HTML文档的语言。它定义选择器，将这些样式与特定的HTML元素相关联。

注解

碎屑选择器是一个很薄的包装 parsel library；这个包装器的目的是提供更好的与slapy响应对象的集成。

parsel 是一个独立的网页爬取库，可以使用没有 Scrapy 。它使用 lxml 库位于引擎盖下，并在LXML API之上实现一个简单的API。这意味着scrapy选择器在速度和解析精度方面与lxml非常相似。

使用选择器¶

构造选择器¶

响应对象公开一个 Selector 实例位于 .selector 属性：

>>> response.selector.xpath('//span/text()').get()
'good'

使用xpath和css查询响应非常常见，因此响应中还包含两个快捷方式： response.xpath() 和 response.css() ：

>>> response.xpath('//span/text()').get()
'good'
>>> response.css('span::text').get()
'good'

粗糙的选择器是 Selector 通过传递以下任一参数构造的类 TextResponse 字符串形式的对象或标记(在 text 参数)。

通常不需要手动构造垃圾选择器： response 对象在spider回调中可用，因此在大多数情况下使用它更方便 response.css() 和 response.xpath() 捷径。通过使用 response.selector 或者这些快捷方式之一，您还可以确保响应主体只解析一次。

但如果需要，可以使用 Selector 直接。从文本构建：

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()
'good'

从响应构造- HtmlResponse 是其中之一 TextResponse 子类：

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url='http://example.com', body=body)
>>> Selector(response=response).xpath('//span/text()').get()
'good'

Selector 根据输入类型自动选择最佳的解析规则（XML对HTML）。

使用选择器¶

为了解释如何使用选择器，我们将使用 Scrapy shell （提供交互式测试）和位于Scrapy文档服务器中的示例页面：

https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

为了完整起见，下面是完整的HTML代码：

<html>
 <head>
  <base href='http://example.com/' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

首先，让我们打开外壳：

scrapy shell https://docs.scrapy.org/en/latest/_static/selectors-sample1.html

然后，在shell加载之后，您将得到可用的响应 response shell变量及其附加的选择器 response.selector 属性。

由于我们处理的是HTML，选择器将自动使用HTML解析器。

所以，通过观察 HTML code 在该页面中，让我们构造一个XPath来选择title标记内的文本：

>>> response.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]

要实际提取文本数据，必须调用选择器 .get() 或 .getall() 方法如下：

>>> response.xpath('//title/text()').getall()
['Example website']
>>> response.xpath('//title/text()').get()
'Example website'

.get() 始终返回单个结果；如果有多个匹配项，则返回第一个匹配项的内容；如果没有匹配项，则不返回任何匹配项。 .getall() 返回包含所有结果的列表。

请注意，CSS选择器可以使用CSS3伪元素选择文本或属性节点：

>>> response.css('title::text').get()
'Example website'

正如你所看到的， .xpath() 和 .css() 方法返回 SelectorList 实例，它是新选择器的列表。此API可用于快速选择嵌套数据：

>>> response.css('img').xpath('@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

如果只提取第一个匹配的元素，则可以调用选择器 .get() （或其别名） .extract_first() 常用于旧版本）：

>>> response.xpath('//div[@id="images"]/a/text()').get()
'Name: My image 1 '

它返回 None 如果找不到元素：

>>> response.xpath('//div[@id="not-exists"]/text()').get() is None
True

可以将默认返回值作为参数提供，以代替 None ：

>>> response.xpath('//div[@id="not-exists"]/text()').get(default='not-found')
'not-found'

代替使用例如 '@src' XPath可以使用以下命令查询属性 .attrib 对象的属性 Selector ：

>>> [img.attrib['src'] for img in response.css('img')]
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

作为捷径， .attrib 也可以直接在SelectorList上使用；它返回第一个匹配元素的属性：

>>> response.css('img').attrib['src']
'image1_thumb.jpg'

当只需要一个结果时（例如，按id选择，或在网页上选择唯一元素时），此选项最有用：

>>> response.css('base').attrib['href']
'http://example.com/'

现在我们将获得基本URL和一些图像链接：

>>> response.xpath('//base/@href').get()
'http://example.com/'

>>> response.css('base::attr(href)').get()
'http://example.com/'

>>> response.css('base').attrib['href']
'http://example.com/'

>>> response.xpath('//a[contains(@href, "image")]/@href').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

>>> response.css('a[href*=image]::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').getall()
['image1_thumb.jpg',
 'image2_thumb.jpg',
 'image3_thumb.jpg',
 'image4_thumb.jpg',
 'image5_thumb.jpg']

CSS选择器的扩展¶

根据W3C标准， CSS selectors 不支持选择文本节点或属性值。但是在Web抓取上下文中选择这些是非常重要的，以至于scrappy（parsel）实现了 non-standard pseudo-elements ：

要选择文本节点，请使用 ::text
要选择属性值，请使用 ::attr(name) 在哪里？ name 是要为其值的属性的名称

警告

这些伪元素是特定于scrapy-/parsel的。他们很可能不会与其他类库合作 lxml 或 PyQuery .

实例：

title::text 选择子代的子文本节点 <title> 元素：

>>> response.css('title::text').get()
'Example website'

*::text 选择当前选择器上下文的所有子代文本节点：

>>> response.css('#images *::text').getall()
['\n   ',
 'Name: My image 1 ',
 '\n   ',
 'Name: My image 2 ',
 '\n   ',
 'Name: My image 3 ',
 '\n   ',
 'Name: My image 4 ',
 '\n   ',
 'Name: My image 5 ',
 '\n  ']

foo::text 如果 foo 元素存在，但不包含文本（即文本为空）：

>>> response.css('img::text').getall()
[]

这意味着 .css('foo::text').get() 即使元素存在，也无法返回“无”。使用 default='' 如果您总是想要字符串：

>>> response.css('img::text').get()
>>> response.css('img::text').get(default='')
''

a::attr(href) 选择 href 子链接的属性值：

>>> response.css('a::attr(href)').getall()
['image1.html',
 'image2.html',
 'image3.html',
 'image4.html',
 'image5.html']

注解

参见：选择元素属性 .

注解

不能链接这些伪元素。但在实践中，这没有多大意义：文本节点没有属性，属性值已经是字符串值，也没有子节点。

嵌套选择器¶

选择方法 (.xpath() 或 .css() )返回同一类型选择器的列表，以便您也可以调用这些选择器的选择方法。下面是一个例子：

>>> links = response.xpath('//a[contains(@href, "image")]')
>>> links.getall()
['<a href="image1.html">Name: My image 1 <br><img src="image1_thumb.jpg"></a>',
 '<a href="image2.html">Name: My image 2 <br><img src="image2_thumb.jpg"></a>',
 '<a href="image3.html">Name: My image 3 <br><img src="image3_thumb.jpg"></a>',
 '<a href="image4.html">Name: My image 4 <br><img src="image4_thumb.jpg"></a>',
 '<a href="image5.html">Name: My image 5 <br><img src="image5_thumb.jpg"></a>']

>>> for index, link in enumerate(links):
...     href_xpath = link.xpath('@href').get()
...     img_xpath = link.xpath('img/@src').get()
...     print(f'Link number {index} points to url {href_xpath!r} and image {img_xpath!r}')
Link number 0 points to url 'image1.html' and image 'image1_thumb.jpg'
Link number 1 points to url 'image2.html' and image 'image2_thumb.jpg'
Link number 2 points to url 'image3.html' and image 'image3_thumb.jpg'
Link number 3 points to url 'image4.html' and image 'image4_thumb.jpg'
Link number 4 points to url 'image5.html' and image 'image5_thumb.jpg'

选择元素属性¶

有几种方法可以获得属性的值。首先，可以使用XPath语法：

>>> response.xpath("//a/@href").getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

xpath语法有几个优点：它是标准的xpath特性，并且 @attributes 可用于xpath表达式的其他部分-例如，可以按属性值筛选。

scrapy还提供了对css选择器的扩展 (::attr(...) )它允许获取属性值：

>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

除此之外，还有 .attrib 选择器的属性。如果您喜欢在Python代码中查找属性，而不使用xpath或CSS扩展，则可以使用它：

>>> [a.attrib['href'] for a in response.css('a')]
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

此属性在SelectorList上也可用；它返回一个字典，其中包含第一个匹配元素的属性。当选择器预期给出单个结果时（例如，当按元素ID选择时，或在页面上选择唯一元素时），使用它非常方便：

>>> response.css('base').attrib
{'href': 'http://example.com/'}
>>> response.css('base').attrib['href']
'http://example.com/'

.attrib 空SelectorList的属性为空：

>>> response.css('foo').attrib
{}

将选择器与正则表达式一起使用¶

Selector 也有一个 .re() 使用正则表达式提取数据的方法。但是，与使用 .xpath() 或 .css() 方法： .re() 返回字符串列表。所以您不能构造嵌套的 .re() 电话。

下面是一个用于从 HTML code 以上：

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')
['My image 1',
 'My image 2',
 'My image 3',
 'My image 4',
 'My image 5']

另外还有一个助手在做往复运动 .get() （及其别名） .extract_first() 为 .re() 命名 .re_first() . 使用它只提取第一个匹配字符串：

>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')
'My image 1'

extract（）和extract_first（）。¶

如果你是一个长期的垃圾用户，你可能熟悉 .extract() 和 .extract_first() 选择器方法。许多博客文章和教程也在使用它们。这些方法仍然由Scrapy支持，有 没有计划 去贬低他们。

但是，现在使用 .get() 和 .getall() 方法。我们认为这些新方法会产生更简洁和可读的代码。

下面的例子展示了这些方法如何相互映射。

SelectorList.get() 是一样的 SelectorList.extract_first() ：

>>> response.css('a::attr(href)').get()
'image1.html'
>>> response.css('a::attr(href)').extract_first()
'image1.html'

SelectorList.getall() 是一样的 SelectorList.extract() ：

>>> response.css('a::attr(href)').getall()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']
>>> response.css('a::attr(href)').extract()
['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']

Selector.get() 是一样的 Selector.extract() ：

>>> response.css('a::attr(href)')[0].get()
'image1.html'
>>> response.css('a::attr(href)')[0].extract()
'image1.html'

为了保持一致性，还有 Selector.getall() ，返回一个列表：
```
>>> response.css('a::attr(href)')[0].getall()
['image1.html']
```

所以，主要的区别在于 .get() 和 .getall() 方法更容易预测： .get() 总是返回单个结果， .getall() 始终返回所有提取结果的列表。用 .extract() 方法：结果是否为列表并不总是显而易见的；或者得到一个单独的结果 .extract() 或 .extract_first() 应该被调用。

使用xpaths¶

下面是一些提示，可以帮助您有效地将xpath与scrapy选择器结合使用。如果您还不太熟悉xpath，可以先看看这个 XPath tutorial .

注解

其中一些技巧是基于 this post from Zyte's blog 。

使用相对路径¶

请记住，如果要嵌套选择器并使用以开头的xpath / ，该xpath对文档是绝对的，而不是相对于 Selector 你是从打电话来的。

例如，假设您希望提取所有 <p> 内部元素 <div> 元素。首先，你会得到所有 <div> 元素：

>>> divs = response.xpath('//div')

首先，您可能会尝试使用以下方法，这是错误的，因为它实际上提取了所有 <p> 文档中的元素，而不仅仅是其中的元素 <div> 元素：

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print(p.get())

这是正确的方法（注意在 .//p XPath）：

>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print(p.get())

另一个常见的情况是提取所有直接 <p> 儿童：

>>> for p in divs.xpath('p'):
...     print(p.get())

有关相对路径的更多详细信息，请参见 Location Paths XPath规范中的节。

按类查询时，请考虑使用CSS¶

因为一个元素可以包含多个CSS类，所以按类选择元素的xpath方法相当冗长：

*[contains(concat(' ', normalize-space(@class), ' '), ' someclass ')]

如果你使用 @class='someclass' 如果只使用 contains(@class, 'someclass') 为了弥补这一点，如果元素具有共享字符串的不同类名，那么最终可能会得到更多想要的元素。 someclass .

事实证明，剪贴选择器允许您链接选择器，因此大多数时候，您都可以使用CSS按类选择，然后在需要时切换到XPath：

>>> from scrapy import Selector
>>> sel = Selector(text='<div class="hero shout"><time datetime="2014-07-23 19:00">Special date</time></div>')
>>> sel.css('.shout').xpath('./time/@datetime').getall()
['2014-07-23 19:00']

这比使用上面显示的详细的xpath技巧要干净。只要记住使用 . 在后面的xpath表达式中。

注意//节点之间的区别 [1] 和（/ /节点） [1]¶

//node[1] 选择所有首先出现在各自父节点下的节点。

(//node)[1] 选择文档中的所有节点，然后只获取其中的第一个节点。

例子：

>>> from scrapy import Selector
>>> sel = Selector(text="""
....:     <ul class="list">
....:         <li>1</li>
....:         <li>2</li>
....:         <li>3</li>
....:     </ul>
....:     <ul class="list">
....:         <li>4</li>
....:         <li>5</li>
....:         <li>6</li>
....:     </ul>""")
>>> xp = lambda x: sel.xpath(x).getall()

这是最重要的 <li> 任何父元素下的元素：

>>> xp("//li[1]")
['<li>1</li>', '<li>4</li>']

这是第一个 <li> 整个文档中的元素：

>>> xp("(//li)[1]")
['<li>1</li>']

这是最重要的 <li> 下的元素 <ul> 起源：

>>> xp("//ul/li[1]")
['<li>1</li>', '<li>4</li>']

这是第一个 <li> 元素在 <ul> 整个文档中的父级：

>>> xp("(//ul/li)[1]")
['<li>1</li>']

在条件中使用文本节点¶

当需要将文本内容用作 XPath string function 避免使用 .//text() and use just . 相反。

这是因为表达式 .//text() 生成一个文本元素集合--a node-set . 当一个节点集被转换成一个字符串时，当它作为参数传递给一个字符串函数（如 contains() 或 starts-with() ，它只为第一个元素生成文本。

例子：

>>> from scrapy import Selector
>>> sel = Selector(text='<a href="#">Click here to go to the <strong>Next Page</strong></a>')

转换A node-set 字符串：

>>> sel.xpath('//a//text()').getall() # take a peek at the node-set
['Click here to go to the ', 'Next Page']
>>> sel.xpath("string(//a[1]//text())").getall() # convert it to string
['Click here to go to the ']

A node 但是，转换为字符串后，会将其自身的文本加上其所有后代的文本组合在一起：

>>> sel.xpath("//a[1]").getall() # select the first node
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']
>>> sel.xpath("string(//a[1])").getall() # convert it to string
['Click here to go to the Next Page']

所以，使用 .//text() 在这种情况下，节点集不会选择任何内容：

>>> sel.xpath("//a[contains(.//text(), 'Next Page')]").getall()
[]

但是使用 . 指的是节点：

>>> sel.xpath("//a[contains(., 'Next Page')]").getall()
['<a href="#">Click here to go to the <strong>Next Page</strong></a>']

xpath表达式中的变量¶

xpath允许您引用xpath表达式中的变量，使用 $somevariable 语法。这与SQL世界中的参数化查询或准备好的语句有点类似，在SQL世界中，将查询中的某些参数替换为诸如 ? ，然后用查询传递的值替换。

下面是一个基于元素“id”属性值匹配元素的示例，而不必对其进行硬编码（如前所示）：

>>> # `$val` used in the expression, a `val` argument needs to be passed
>>> response.xpath('//div[@id=$val]/a/text()', val='images').get()
'Name: My image 1 '

下面是另一个示例，用于查找 <div> 包含五个的标签 <a> 孩子们（在这里我们传递价值 5 整数形式）：

>>> response.xpath('//div[count(a)=$cnt]/@id', cnt=5).get()
'images'

调用时，所有变量引用都必须具有绑定值 .xpath() （否则你会得到 ValueError: XPath error: 例外）。这是通过根据需要传递尽可能多的命名参数来实现的。

parsel _为scrappy选择器供电的库提供了更多的细节和示例 XPath variables .

正在删除命名空间¶

在处理爬取项目时，完全消除名称空间，只使用元素名，编写更简单/方便的xpaths通常是非常方便的。你可以使用 Selector.remove_namespaces() 方法。

让我们展示一个例子，用Python Insider博客Atom feed来说明这一点。

首先，我们用要抓取的URL打开外壳：

$ scrapy shell https://feeds.feedburner.com/PythonInsider

文件就是这样开始的：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet ...
<feed xmlns="http://www.w3.org/2005/Atom"
      xmlns:openSearch="http://a9.com/-/spec/opensearchrss/1.0/"
      xmlns:blogger="http://schemas.google.com/blogger/2008"
      xmlns:georss="http://www.georss.org/georss"
      xmlns:gd="http://schemas.google.com/g/2005"
      xmlns:thr="http://purl.org/syndication/thread/1.0"
      xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
  ...

您可以看到几个名称空间声明，其中包括默认的“http://www.w3.org/2005/atom”，以及另一个使用“gd:”前缀的“http://schemas.google.com/g/2005”。

一旦进入外壳，我们可以尝试选择所有 <link> 对象并查看它是否不起作用（因为Atom XML命名空间混淆了这些节点）：

>>> response.xpath("//link")
[]

但一旦我们呼叫 Selector.remove_namespaces() 方法，则所有节点都可以通过其名称直接访问：

>>> response.selector.remove_namespaces()
>>> response.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html" h'>,
    <Selector xpath='//link' data='<link rel="next" type="application/atom+'>,
    ...

如果您想知道为什么不总是在默认情况下调用命名空间移除过程，而不必手动调用它，这是因为两个原因，按照相关性的顺序，这是：

删除命名空间需要迭代和修改文档中的所有节点，这是一个相当昂贵的操作，默认情况下，对scrapy所爬行的所有文档执行此操作。
在某些情况下，实际需要使用名称空间，以防某些元素名称在名称空间之间发生冲突。不过，这些病例非常罕见。

使用exslt扩展¶

建在顶上 lxml ，scrapy选择器支持一些 EXSLT 扩展名，并附带这些预注册的命名空间以在xpath表达式中使用：

前缀	命名空间	使用
重新	http://exslt.org/正则表达式	regular expressions
设置	http://exslt.org/sets	set manipulation

正则表达式¶

这个 test() 例如，当xpath的 starts-with() 或 contains() 还不够。

在列表项中选择“class”属性以数字结尾的链接的示例：

>>> from scrapy import Selector
>>> doc = """
... <div>
...     <ul>
...         <li class="item-0"><a href="link1.html">first item</a></li>
...         <li class="item-1"><a href="link2.html">second item</a></li>
...         <li class="item-inactive"><a href="link3.html">third item</a></li>
...         <li class="item-1"><a href="link4.html">fourth item</a></li>
...         <li class="item-0"><a href="link5.html">fifth item</a></li>
...     </ul>
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> sel.xpath('//li//@href').getall()
['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').getall()
['link1.html', 'link2.html', 'link4.html', 'link5.html']

警告

C库 libxslt 本机不支持exslt正则表达式，因此 lxml 的实现使用了对python的钩子 re 模块。因此，在xpath表达式中使用regexp函数可能会增加一点性能损失。

集合运算¶

例如，在提取文本元素之前，可以方便地排除文档树的某些部分。

示例提取微数据（样本内容取自https://schema.org/Product)包含一组itemscope和相应的itemprop:：

>>> doc = """
... <div itemscope itemtype="http://schema.org/Product">
...   <span itemprop="name">Kenmore White 17" Microwave</span>
...   <img src="kenmore-microwave-17in.jpg" alt='Kenmore 17" Microwave' />
...   <div itemprop="aggregateRating"
...     itemscope itemtype="http://schema.org/AggregateRating">
...    Rated <span itemprop="ratingValue">3.5</span>/5
...    based on <span itemprop="reviewCount">11</span> customer reviews
...   </div>
...
...   <div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
...     <span itemprop="price">$55.00</span>
...     <link itemprop="availability" href="http://schema.org/InStock" />In stock
...   </div>
...
...   Product description:
...   <span itemprop="description">0.7 cubic feet countertop microwave.
...   Has six preset cooking categories and convenience features like
...   Add-A-Minute and Child Lock.</span>
...
...   Customer reviews:
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Not a happy camper</span> -
...     by <span itemprop="author">Ellie</span>,
...     <meta itemprop="datePublished" content="2011-04-01">April 1, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1">
...       <span itemprop="ratingValue">1</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">The lamp burned out and now I have to replace
...     it. </span>
...   </div>
...
...   <div itemprop="review" itemscope itemtype="http://schema.org/Review">
...     <span itemprop="name">Value purchase</span> -
...     by <span itemprop="author">Lucas</span>,
...     <meta itemprop="datePublished" content="2011-03-25">March 25, 2011
...     <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">
...       <meta itemprop="worstRating" content = "1"/>
...       <span itemprop="ratingValue">4</span>/
...       <span itemprop="bestRating">5</span>stars
...     </div>
...     <span itemprop="description">Great microwave for the price. It is small and
...     fits in my apartment.</span>
...   </div>
...   ...
... </div>
... """
>>> sel = Selector(text=doc, type="html")
>>> for scope in sel.xpath('//div[@itemscope]'):
...     print("current scope:", scope.xpath('@itemtype').getall())
...     props = scope.xpath('''
...                 set:difference(./descendant::*/@itemprop,
...                                .//*[@itemscope]/*/@itemprop)''')
...     print(f"    properties: {props.getall()}")
...     print("")

current scope: ['http://schema.org/Product']
    properties: ['name', 'aggregateRating', 'offers', 'description', 'review', 'review']

current scope: ['http://schema.org/AggregateRating']
    properties: ['ratingValue', 'reviewCount']

current scope: ['http://schema.org/Offer']
    properties: ['price', 'availability']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

current scope: ['http://schema.org/Review']
    properties: ['name', 'author', 'datePublished', 'reviewRating', 'description']

current scope: ['http://schema.org/Rating']
    properties: ['worstRating', 'ratingValue', 'bestRating']

在这里我们首先迭代 itemscope 元素，每一个元素，我们都在寻找 itemprops 元素并排除它们本身在另一个元素中的元素 itemscope .

其他XPath扩展¶

scrapy选择器还提供一个非常遗漏的xpath扩展函数 has-class 它会回来 True 对于具有所有指定HTML类的节点。

对于以下HTML:：

<p class="foo bar-baz">First</p>
<p class="foo">Second</p>
<p class="bar">Third</p>
<p>Fourth</p>

你可以这样使用它：

>>> response.xpath('//p[has-class("foo")]')
[<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
 <Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> response.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> response.xpath('//p[has-class("foo", "bar")]')
[]

所以XPath //p[has-class("foo", "bar-baz")] 大致相当于CSS p.foo.bar-baz . 请注意，在大多数情况下，它的速度较慢，因为它是一个纯Python函数，可以为问题中的每个节点调用，而CSS查找被转换为xpath，因此运行效率更高，因此性能方面，它的使用仅限于不容易用css选择器描述的情况。

Parsel还简化了添加自己的xpath扩展。

parsel.xpathfuncs.set_xpathfunc(fname, func)[源代码]¶

注册要在xpath表达式中使用的自定义扩展函数。

函数 func 注册于 fname 将为每个匹配节点调用标识符，并将其传递给 context 参数以及从相应的xpath表达式传递的任何参数。

如果 func 是 None ，将删除扩展功能。

查看更多 in lxml documentation .

内置选择器引用¶

选择器对象¶

class scrapy.selector.Selector(*args, **kwargs)[源代码]¶

的实例 Selector 是一个包装响应，用于选择其内容的某些部分。

response 是一个 HtmlResponse 或 XmlResponse 将用于选择和提取数据的对象。

text 是Unicode字符串或UTF-8编码文本，用于 response 不可用。使用 text 和 response 一起是未定义的行为。

type 定义选择器类型，它可以是 "html" ， "xml" 或 None （默认）。

如果 type 是 None ，选择器根据 response 类型（见下文），或默认为 "html" 以防与 text .

如果 type 是 None 和A response 如果传递，则从响应类型推断选择器类型，如下所示：

"html" 对于 HtmlResponse 类型
"xml" 对于 XmlResponse 类型
"html" 还有什么事吗

否则，如果 type 设置后，选择器类型将被强制，不会发生检测。

xpath(query, namespaces=None, **kwargs)[源代码]¶

查找与xpath匹配的节点 query 并将结果作为 SelectorList 将所有元素展平的实例。列表元素实现 Selector 接口也是如此。

query 是包含要应用的XPath查询的字符串。

namespaces 是可选的 prefix: namespace-uri 将附加前缀的映射（dict）映射到 register_namespace(prefix, uri) . 相反 register_namespace() ，这些前缀不会保存以备将来调用。

可以使用任何其他命名参数来传递xpath表达式中xpath变量的值，例如：：

selector.xpath('//a[href=$url]', url="http://www.example.com")

注解

为了方便起见，此方法可以调用为 response.xpath()

css(query)[源代码]¶

应用给定的css选择器并返回 SelectorList 实例。

query 是包含要应用的CSS选择器的字符串。

在后台，使用 cssselect 类库与运行 .xpath() 方法。

注解

为了方便起见，此方法可以调用为 response.css()

get()[源代码]¶

序列化并返回单个Unicode字符串中匹配的节点。未引用编码内容的百分比。

参见： extract（）和extract_first（）。

attrib¶

返回基础元素的属性字典。

参见：选择元素属性 .

re(regex, replace_entities=True)[源代码]¶

应用给定的regex并返回带有匹配项的Unicode字符串列表。

regex 可以是已编译的正则表达式，也可以是将使用 re.compile(regex) .

默认情况下，字符实体引用替换为其相应的字符（除了 & 和 < ）经过 replace_entities 作为 False 关闭这些替换。

re_first(regex, default=None, replace_entities=True)[源代码]¶

应用给定的regex并返回第一个匹配的unicode字符串。如果没有匹配项，则返回默认值 (None 如果未提供参数）。

默认情况下，字符实体引用替换为其相应的字符（除了 & 和 < ）经过 replace_entities 作为 False 关闭这些替换。

register_namespace(prefix, uri)[源代码]¶: 注册要在此中使用的给定命名空间 Selector . 如果不注册命名空间，则无法从非标准命名空间中选择或提取数据。见 XML响应的选择器示例 .

remove_namespaces()[源代码]¶: 删除所有名称空间，允许使用不含名称空间的xpaths遍历文档。见正在删除命名空间 .

__bool__()[源代码]¶: 返回 True 如果选择了任何真实内容或 False 否则。换句话说，布尔值 Selector 由它选择的内容给出。

getall()[源代码]¶

序列化并返回unicode字符串的1元素列表中匹配的节点。

为了保持一致性，这个方法被添加到选择器中；它对于选择器列表更有用。参见： extract（）和extract_first（）。

SelectorList对象¶

class scrapy.selector.SelectorList(iterable=(), /)[源代码]¶

这个 SelectorList 类是内置的子类 list 类，它提供了一些附加方法。

xpath(xpath, namespaces=None, **kwargs)[源代码]¶

打电话给 .xpath() 此列表中的每个元素的方法，并将其结果扁平化为另一个 SelectorList .

query is the same argument as the one in Selector.xpath()

namespaces 是可选的 prefix: namespace-uri 将附加前缀的映射（dict）映射到 register_namespace(prefix, uri) . 相反 register_namespace() ，这些前缀不会保存以备将来调用。

可以使用任何其他命名参数来传递xpath表达式中xpath变量的值，例如：：

selector.xpath('//a[href=$url]', url="http://www.example.com")

css(query)[源代码]¶

打电话给 .css() 此列表中的每个元素的方法，并将其结果扁平化为另一个 SelectorList .

query is the same argument as the one in Selector.css()

getall()[源代码]¶

打电话给 .get() 每个元素的方法都是这个列表，并将它们的结果作为一个Unicode字符串列表平展地返回。

参见： extract（）和extract_first（）。

get(default=None)[源代码]¶

返回的结果 .get() 对于此列表中的第一个元素。如果列表为空，则返回默认值。

参见： extract（）和extract_first（）。

re(regex, replace_entities=True)[源代码]¶

打电话给 .re() 方法，并以unicode字符串列表的形式返回结果。

默认情况下，字符实体引用替换为其相应的字符（除了 & 和 < .经过 replace_entities 作为 False 关闭这些替换。

re_first(regex, default=None, replace_entities=True)[源代码]¶

打电话给 .re() 方法，并以Unicode字符串返回结果。如果列表为空或regex不匹配，则返回默认值 (None 如果未提供参数）。

默认情况下，字符实体引用替换为其相应的字符（除了 & 和 < .经过 replace_entities 作为 False 关闭这些替换。

attrib¶

返回第一个元素的属性字典。如果列表为空，则返回空的dict。

参见：选择元素属性 .

实例¶

HTML响应的选择器示例¶

这里有一些 Selector 举例说明几个概念。在所有情况下，我们假设 Selector 用一个 HtmlResponse 这样的对象：

sel = Selector(html_response)

选择全部 <h1> 来自HTML响应正文的元素，返回 Selector 对象（即 SelectorList 对象）：
```
sel.xpath("//h1")
```

提取所有文本 <h1> 元素，返回字符串列表：

sel.xpath("//h1").getall()         # this includes the h1 tag
sel.xpath("//h1/text()").getall()  # this excludes the h1 tag

全部迭代 <p> 标记并打印其类属性：

for node in sel.xpath("//p"):
    print(node.attrib['class'])

XML响应的选择器示例¶

下面是一些例子来说明 Selector 对象用 XmlResponse 对象：

sel = Selector(xml_response)

选择全部 <product> 来自XML响应主体的元素，返回 Selector 对象（即 SelectorList 对象）：
```
sel.xpath("//product")
```

从A中提取所有价格 Google Base XML feed 需要注册命名空间：：

sel.register_namespace("g", "http://base.google.com/ns/1.0")
sel.xpath("//g:price").getall()