资源经验分享使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

2020-02-27 | |  60 |   0

原标题: 使用Python爬虫库BeautifulSoup遍历文档树并对标签进行操作详解

来源:CSDN博客      链接:https://blog.csdn.net/haoxun05/article/details/104506265


今天为大家介绍下Python爬虫库BeautifulSoup遍历文档树并对标签进行操作的详细方法与函数

下面就是使用Python爬虫库BeautifulSoup对文档树进行遍历并对标签进行操作的实例,都是最基础的内容


html_doc = """

<html><head><title>The Dormouse's story</title></head>

 

<p class="title"><b>The Dormouse's story</b></p>

 

<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,

<a href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and

<a href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;

and they lived at the bottom of a well.</p>

 

<p class="story">...</p>

"""

 

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')


一、子节点


一个Tag可能包含多个字符串或者其他Tag,这些都是这个Tag的子节点.BeautifulSoup提供了许多操作和遍历子结点的属性。


1.通过Tag的名字来获得Tag


print(soup.head)

print(soup.title)


<head><title>The Dormouse's story</title></head>

<title>The Dormouse's story</title>


通过名字的方法只能获得第一个Tag,如果要获得所有的某种Tag可以使用find_all方法


soup.find_all('a')


[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,

 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]


2.contents属性:将Tag的子节点通过列表的方式返回


head_tag = soup.head

head_tag.contents


[<title>The Dormouse's story</title>]


title_tag = head_tag.contents[0]

title_tag


<title>The Dormouse's story</title>


title_tag.contents


["The Dormouse's story"]


3.children:通过该属性对子节点进行循环


for child in title_tag.children:

  print(child)


The Dormouse's story


4.descendants: 不论是contents还是children都是返回直接子节点,而descendants对所有tag的子孙节点进行递归循环


for child in head_tag.children:

  print(child)


```bash

```

for child in head_tag.descendants:

  print(child)


<title>The Dormouse's story</title>

The Dormouse's story


5.string 如果tag只有一个NavigableString类型的子节点,那么tag可以使用.string得到该子节点


title_tag.string


"The Dormouse's story"


如果一个tag只有一个子节点,那么使用.string可以获得其唯一子结点的NavigableString.


head_tag.string


head_tag.string


如果tag有多个子节点,tag无法确定.string对应的是那个子结点的内容,故返回None

print(soup.html.string)


None


6.strings和stripped_strings


如果tag包含多个字符串,可以使用.strings循环获取


for string in soup.strings:

  print(string)


The Dormouse's story

 

The Dormouse's story

  

Once upon a time there were three little sisters; and their names were

 

Elsie

,


Lacie

 and

 

Tillie

;

and they lived at the bottom of a well.

 

...

.string输出的内容包含了许多空格和空行,使用strpped_strings去除这些空白内容


for string in soup.stripped_strings:

  print(string)


The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were

Elsie

,

Lacie

and

Tillie

;

and they lived at the bottom of a well.

...


二、父节点


1.parent:获得某个元素的父节点


title_tag = soup.title

title_tag.parent


<head><title>The Dormouse's story</title></head>


字符串也有父节点


title_tag.string.parent


<title>The Dormouse's story</title>


2.parents:递归的获得所有父辈节点


link = soup.a

for parent in link.parents:

  if parent is None:

    print(parent)

  else:

    print(parent.name)


body

html

[document]


三、兄弟结点


sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>",'lxml')

print(sibling_soup.prettify())


<html>

 <body>

 <a>

  <b>

  text1

  </b>

  <c>

  text2

  </c>

 </a>

 </body>

</html>


1.next_sibling和previous_sibling


sibling_soup.b.next_sibling


<c>text2</c>


sibling_soup.c.previous_sibling


<b>text1</b>


在实际文档中.next_sibling和previous_sibling通常是字符串或者空白符


soup.find_all('a')


[<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,

 <a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,

 <a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]


soup.a.next_sibling # 第一个<a></a>的next_sibling是,n


```bash


‘,n’


```bash

soup.a.next_sibling.next_sibling


<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>


2.next_siblings和previous_siblings


for sibling in soup.a.next_siblings:

  print(repr(sibling))


',n'

<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

' andn'

<a class="sister" href="http://example.com/tillie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

';nand they lived at the bottom of a well.'


for sibling in soup.find(id="link3").previous_siblings:

  print(repr(sibling))


' andn'

<a class="sister" href="http://example.com/lacie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>

',n'

<a class="sister" href="http://example.com/elsie" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>

'Once upon a time there were three little sisters; and their names weren'


四、回退与前进

1.next_element和previous_element


指向下一个或者前一个被解析的对象(字符串或tag),即深度优先遍历的后序节点和前序节点


last_a_tag = soup.find("a", id="link3")

print(last_a_tag.next_sibling)

print(last_a_tag.next_element)

;


and they lived at the bottom of a well.

Tillie

last_a_tag.previous_element

' andn'

2.next_elements和previous_elements


通过.next_elements和previous_elements可以向前或向后访问文档的解析内容,就好像文档正在被解析一样


for element in last_a_tag.next_elements:

  print(repr(element))


'Tillie'

';nand they lived at the bottom of a well.'

'n'

<p class="story">...</p>

'...'

'n'


推荐我们的python学习基地,点击进入,看老程序是如何学习的!从基础的python脚本、爬虫、django、数据挖掘等编程技术,工作经验,还有前辈精心为学习python的小伙伴整理零基础到项目实战的资料,!每天都有程序员定时讲解Python技术,分享一些学习的方法和需要留意的小细节

————————————————

版权声明:本文为CSDN博主「python进步学习者」的原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接及本声明。

原文链接:https://blog.csdn.net/haoxun05/article/details/104506265

THE END

免责声明:本文来自互联网新闻客户端自媒体,不代表本网的观点和立场。

合作及投稿邮箱:E-mail:editor@tusaishared.com

上一篇:如何在Kaggle上打比赛,带你进行一次完整流程体验

下一篇:PyQt5将内存中图像数组转换QPixmap、QImage的方法

用户评价
全部评价

热门资源

  • Python 爬虫(二)...

    所谓爬虫就是模拟客户端发送网络请求,获取网络响...

  • TensorFlow从1到2...

    原文第四篇中,我们介绍了官方的入门案例MNIST,功...

  • TensorFlow从1到2...

    “回归”这个词,既是Regression算法的名称,也代表...

  • TensorFlow2.0(10...

    前面的博客中我们说过,在加载数据和预处理数据时...

  • 机器学习中的熵、...

    熵 (entropy) 这一词最初来源于热力学。1948年,克...