使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

2020-02-27 |

97 |

原标题：使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

来源：CSDN博客链接：https://blog.csdn.net/haoxun05/article/details/104506265

今天为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例，都是最基础的内容

html_doc = """

<html><head><title>The Dormouse's story</title></head>

The Dormouse's story

Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.

...

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')

一、子节点

一个Tag可能包含多个字符串或者其他Tag，这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。

1.通过Tag的名字来获得Tag

print(soup.head)

print(soup.title)

<head><title>The Dormouse's story</title></head>

<title>The Dormouse's story</title>

通过名字的方法只能获得第一个Tag，如果要获得所有的某种Tag可以使用find_all方法

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,

<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

2.contents属性：将Tag的子节点通过列表的方式返回

head_tag = soup.head

head_tag.contents

[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]

title_tag

<title>The Dormouse's story</title>

title_tag.contents

["The Dormouse's story"]

3.children：通过该属性对子节点进行循环

for child in title_tag.children:

print(child)

The Dormouse's story

4.descendants：不论是contents还是children都是返回直接子节点，而descendants对所有tag的子孙节点进行递归循环

for child in head_tag.children:

print(child)

```bash

```

for child in head_tag.descendants:

print(child)

<title>The Dormouse's story</title>

The Dormouse's story

5.string 如果tag只有一个NavigableString类型的子节点，那么tag可以使用.string得到该子节点

title_tag.string

"The Dormouse's story"

如果一个tag只有一个子节点，那么使用.string可以获得其唯一子结点的NavigableString.

head_tag.string

如果tag有多个子节点，tag无法确定.string对应的是那个子结点的内容，故返回None

print(soup.html.string)

None

6.strings和stripped_strings

如果tag包含多个字符串，可以使用.strings循环获取

for string in soup.strings:

print(string)

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

Lacie

and

Tillie

;

and they lived at the bottom of a well.

...

.string输出的内容包含了许多空格和空行，使用strpped_strings去除这些空白内容

for string in soup.stripped_strings:

print(string)

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

Lacie

and

Tillie

;

and they lived at the bottom of a well.

...

二、父节点

1.parent：获得某个元素的父节点

title_tag = soup.title

title_tag.parent

<head><title>The Dormouse's story</title></head>

字符串也有父节点

title_tag.string.parent

<title>The Dormouse's story</title>

2.parents：递归的获得所有父辈节点

link = soup.a

for parent in link.parents:

if parent is None:

print(parent)

else:

print(parent.name)

body

html

[document]

三、兄弟结点

sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>",'lxml')

print(sibling_soup.prettify())

<html>

<body>

<a>

text1

<c>

text2

</c>

</a>

</body>

</html>

1.next_sibling和previous_sibling

sibling_soup.b.next_sibling

sibling_soup.c.previous_sibling

text1

在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符

soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,

<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

soup.a.next_sibling # 第一个<a></a>的next_sibling是,n

```bash

‘,n’

```bash

soup.a.next_sibling.next_sibling

2.next_siblings和previous_siblings

for sibling in soup.a.next_siblings:

print(repr(sibling))

',n'

' andn'

<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

';nand they lived at the bottom of a well.'

for sibling in soup.find(id="link3").previous_siblings:

print(repr(sibling))

' andn'

',n'

<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

'Once upon a time there were three little sisters; and their names weren'

四、回退与前进

1.next_element和previous_element

指向下一个或者前一个被解析的对象(字符串或tag)，即深度优先遍历的后序节点和前序节点

last_a_tag = soup.find("a", id="link3")

print(last_a_tag.next_sibling)

print(last_a_tag.next_element)

;

and they lived at the bottom of a well.

Tillie

last_a_tag.previous_element

' andn'

2.next_elements和previous_elements

通过.next_elements和previous_elements可以向前或向后访问文档的解析内容，就好像文档正在被解析一样

for element in last_a_tag.next_elements:

print(repr(element))

'Tillie'

';nand they lived at the bottom of a well.'

'n'

...

'...'

'n'

推荐我们的python学习基地，点击进入，看老程序是如何学习的！从基础的python脚本、爬虫、django、数据挖掘等编程技术，工作经验，还有前辈精心为学习python的小伙伴整理零基础到项目实战的资料，！每天都有程序员定时讲解Python技术，分享一些学习的方法和需要留意的小细节

————————————————

原文链接：https://blog.csdn.net/haoxun05/article/details/104506265

一THE END一

免责声明：本文来自互联网新闻客户端自媒体，不代表本网的观点和立场。

合作及投稿邮箱：E-mail:editor@tusaishared.com

上一篇：如何在Kaggle上打比赛，带你进行一次完整流程体验

下一篇：PyQt5将内存中图像数组转换QPixmap、QImage的方法

用户评价

全部评价

热门资源

Python 爬虫（二）...

所谓爬虫就是模拟客户端发送网络请求，获取网络响...
TensorFlow从1到2...

原文第四篇中，我们介绍了官方的入门案例MNIST，功...
TensorFlow从1到2...

“回归”这个词，既是Regression算法的名称，也代表...
NLP自然语言处理的...

NLP自然语言处理的开发环境搭建
TensorFlow2.0（9...

TensorBoard是TensorFlow中的又一神器级工具，想用...