Python爬虫：PyQuery库的安装与使用

2017/1/30 0 Comments 8,147 Views 3 Times

前言

Python关于爬虫的库挺多的，也各有所长。了解前端的也都知道， jQuery 能够通过选择器精确定位 DOM 树中的目标并进行操作，所以我想如果能用 jQuery 去爬网页那就 cool 了。

就搜了下看 Python 有没有与 DOM 相关的库什么的，还真找到了——PyQuery ！

PyQuery简介

pyquery相当于jQuery的python实现，可以用于解析HTML网页等。它的语法与jQuery几乎完全相同，对于使用过jQuery的人来说很熟悉，也很好上手。

引用作者的原话就是：

“The API is as much as possible the similar to jquery.” 。

安装

使用 pip 或者 easy_install 都可以。

注意：由于 pyquery 依赖于 lxml ，要先安装 lxml ，否则会提示失败。

安装lxml：https://pypi.python.org/pypi/lxml/2.3/ (建议直接下载安装包，方便快捷)；
安装pyquery：easy_install pyquery或者pip install pyquery；
验证：输入import pyquery回车不报错即安装成功；

初始化

有 4 种方法可以进行初始化：

可以通过传入字符串、lxml、文件或者 url 来使用PyQuery。

from pyquery import PyQuery as pq
from lxml import etree

d = pq("<html></html>")                      #传入字符串
d = pq(etree.fromstring("<html></html>"))    #传入lxml
d = pq(url='http://google.com/')             #传入url
d = pq(filename=path_to_html_file)           #传入文件

from pyquery import PyQuery as pq

from lxml import etree

d = pq("<html></html>") #传入字符串

d = pq(etree.fromstring("<html></html>")) #传入lxml

d = pq(url='http://google.com/') #传入url

d = pq(filename=path_to_html_file) #传入文件

现在，d 就像 jQuery 中的 $ 一样了。

示例

通过一个简单的例子快速熟悉 pyquery 的用法，传入文件 example.html，内容如下：

<div>
  <tr class="item-0">
    <td>first section</td>
    <td>1111</td>
    <td>17-01-28 22:51</td>
  </tr>
  <tr class="item-1">
    <td>second section</td>
    <td>2222</td>
    <td>17-01-28 22:53</td>
  </tr>
</div>

<div>

<td>first section</td>

</tr>

<td>second section</td>

</tr>

</div>

python 程序：

# -*- coding: utf-8 -*-
from pyquery import PyQuery as pq    #引入 PyQuery

doc = pq(filename='hello.html')  # 传入文件 hello.html

print doc.html()                 # html()方法获取当前选中的 html 块

print doc('.item-1')             # 相当于 class 选择器，选取 class 为 item-1 的 html 块

data = doc('tr')                 # 选取 <tr> 元素

for tr in data.items():          # 遍历 data 中的 <tr> 元素
    temp = tr('td').eq(2).text() # 选取第3个 <td> 元素中的文本块
    print temp

# -*- coding: utf-8 -*-

from pyquery import PyQuery as pq #引入 PyQuery

doc = pq(filename='hello.html') # 传入文件 hello.html

print doc.html() # html()方法获取当前选中的 html 块

print doc('.item-1') # 相当于 class 选择器，选取 class 为 item-1 的 html 块

data = doc('tr') # 选取 <tr> 元素

for tr in data.items(): # 遍历 data 中的 <tr> 元素

temp = tr('td').eq(2).text() # 选取第3个 <td> 元素中的文本块

print temp

运行结果：

# print doc.html()
    <tr class="item-0">
        <td>first section</td>
        <td>1111</td>
        <td>17-01-28 22:51</td>
    </tr>
    <tr class="item-1">
        <td>second section</td>
        <td>2222</td>
        <td>17-01-28 22:53</td>
    </tr>

# print doc('.item-1')
    <tr class="item-1">
        <td>second section</td>
        <td>2222</td>
        <td>17-01-28 22:53</td>
    </tr>

# print tr('td').eq(2).text()
17-01-28 22:51
# print tr('td').eq(2).text()
17-01-28 22:53

# print doc.html()

<td>first section</td>

</tr>

<td>second section</td>

</tr>

# print doc('.item-1')

<td>second section</td>

</tr>

# print tr('td').eq(2).text()

17-01-28 22:51

# print tr('td').eq(2).text()

17-01-28 22:53

操作

1. .html() 和 .text()：获取相应的 HTML 块或者文本内容，

p=pq("<head><title>Hello World!</title></head>")

print p('head').html()            # 获取相应的 HTML 块              
print p('head').text()            # 获取相应的文本内容                

'''输出：
<title>hello</title>
Hello World!
'''

p=pq("<head><title>Hello World!</title></head>")

print p('head').html() # 获取相应的 HTML 块

print p('head').text() # 获取相应的文本内容

'''输出：

<title>hello</title>

Hello World!

'''

2..('selector') ：通过选择器来获取目标内容，

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('div').html()              # 获取 <div> 元素内的 HTML 块
print d('#item-0').text()          # 获取 id 为 item-0 的元素内的文本内容
print d('.item-1').text()          # 获取 class 为 item-1 的元素的文本内容

'''输出：
<p id="item-0">test 1</p><p class="item-1">test 2</p>
test 1
test 2
'''

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('div').html() # 获取 <div> 元素内的 HTML 块

print d('#item-0').text() # 获取 id 为 item-0 的元素内的文本内容

print d('.item-1').text() # 获取 class 为 item-1 的元素的文本内容

'''输出：

test 1

test 2

'''

3..eq(index) ：根据索引号获取指定元素（index 从 0 开始），

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('p').eq(1).text()          # 获取第二个 p 元素的文本内容，

'''输出
test 2
'''

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('p').eq(1).text() # 获取第二个 p 元素的文本内容，

'''输出

test 2

'''

4..find() ：查找嵌套元素，

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('div').find('p')           # 查找 <div> 内的 p 元素
print d('div').find('p').eq(0)     # 查找 <div> 内的 p 元素，输出第一个 p 元素

'''输出：
<p id="item-0">test 1</p><p class="item-1">test 2</p>
<p id="item-0">test 1</p>
'''

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('div').find('p') # 查找 <div> 内的 p 元素

print d('div').find('p').eq(0) # 查找 <div> 内的 p 元素，输出第一个 p 元素

'''输出：

'''

5..filter() ：根据 class、id 筛选指定元素，

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('p').filter('.item-1')           # 查找 class 为 item-1 的 p 元素
print d('p').filter('#item-0')           # 查找 id 为 item-0 的 p 元素

'''输出：
<p class="item-1">test 2</p>
<p id="item-0">test 1</p>
'''

d = pq("<div><p id='item-0'>test 1</p><p class='item-1'>test 2</p></div>")

print d('p').filter('.item-1') # 查找 class 为 item-1 的 p 元素

print d('p').filter('#item-0') # 查找 id 为 item-0 的 p 元素

'''输出：

'''

6..attr() ：获取、修改属性值，

d = pq("<div><p id='item-0'>test 1</p><a class='item-1'>test 2</p></div>")

print d('p').attr('id')                   # 获取 <p> 标签的属性 id
print d('a').attr('class','new')          # 修改 <a> 标签的 class 属性为 new

'''输出：
item-0
<a class="new">test 2</a>
'''

d = pq("<div><p id='item-0'>test 1</p><a class='item-1'>test 2</p></div>")

print d('p').attr('id') # 获取 <p> 标签的属性 id

print d('a').attr('class','new') # 修改 <a> 标签的 class 属性为 new

'''输出：

item-0

'''

7.其他操作：

.addClass(value) ：添加 class；

.hasClass(name) ：判断是否包含指定的 class，返回 True 或 False；

.children() ：获取子元素；

.parents()：获取父元素；

.next() ：获取下一个元素；

.nextAll() ：获取后面全部元素块；

.not_('selector') ：获取所有不匹配该选择器的元素；

for i in d.items('li'): print i.text() ：遍历 d 中的 li 元素；

结语

以上的操作对于日常爬取一些小数据资料，基本足够使用了。当然，PyQuery 还有很多其他内容，这里就不做说明了，如果需要了解更多关于 PyQuery 的内容的可以去查看官方文档。

官方文档是英文的，但也比较容易阅读和理解。我找到了一个中文的教程网站，这里也提供出来。

官方文档：https://pythonhosted.org/pyquery/index.html#

中文教程：http://www.geoinformatics.cn/lab/pyquery/

Go 2 Think 原创文章，转载请注明来源及原文链接

原文链接：https://go2think.com/python%e7%88%ac%e8%99%ab%ef%bc%9apyquery%e5%ba%93%e7%9a%84%e5%ae%89%e8%a3%85%e4%b8%8e%e4%bd%bf%e7%94%a8/

Go 2 Think

Go 2 Think

Python爬虫：PyQuery库的安装与使用

前言

PyQuery简介

安装

初始化

示例

操作

结语

猜你喜欢

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

发表评论取消回复

Go 2 Think

前言

PyQuery简介

安装

初始化

示例

操作

结语

猜你喜欢

本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可

发表评论 取消回复

本作品采用知识共享署名-相同方式共享 4.0 国际许可协议进行许可

发表评论取消回复