Beautiful_Soup中文文档 PDF 下载_Java知识分享网-免费Java资源下载

失效链接处理

Beautiful_Soup中文文档 PDF 下载

本站整理下载：

链接：https://pan.baidu.com/s/1ujneC0iNYB2M1xQHzNfs7Q

提取码：y651

相关截图：

主要内容：

from BeautifulSoup import BeautifulSoup

import re

doc = ['<html><head><title>Page title</title></head>',

'<body>This is paragraph one.',

'This is paragraph two.',

'</html>']

soup = BeautifulSoup (''.join(doc))

print soup.prettify()

# <html>

# <head>

# <title>

# Page title

# </title>

# </head>

# <body>

#

# This is paragraph

#

# one

#

# .

#

#

# This is paragraph

#

# two

#

# .

#

# </body>

# </html>

navigate soup的一些方法:

soup.contents[0].name

# u'html'

soup.contents[0].contents[0].name

# u'head'

head = soup.contents[0].contents[0]

head.parent.name

# u'html'

head.next

# <title>Page title</title>

head.nextSibling .name

# u'body'

head.nextSibling .contents[0]

# This is paragraph one.

head.nextSibling .contents[0].nextSibling

# This is paragraph two.

下面是一些方法搜索soup，获得特定标签或有着特定属性的标签：

titleTag = soup.html.head.title

titleTag

# <title>Page title</title>

titleTag.string

# u'Page title'

len(soup('p'))

# 2

soup.findAll('p', align="center")

# [This is paragraph one. ]

soup.find('p', align="center")

# This is paragraph one.

soup('p', align="center")[0]['id']

# u'firstpara'

soup.find('p', align=re.compile('^b.*'))['id']

# u'secondpara'

soup.find('p').b.string

# u'one'

soup('p')[1].b.string

# u'two'

修改soup也很简单：

titleTag['id'] = 'theTitle'

titleTag.contents[0].replaceWith ("New title")

soup.html.head

# <head><title id="theTitle">New title</title></head>

soup.p.extract()

soup.prettify()

# <html>

# <head>

Beautiful Soup documentation Page 3

http://www.crummy.com/software/BeautifulSoup/documentation.zh.html 8/12/2010 3:58:02 PM

# <title id="theTitle">

# New title

# </title>

# </head>

# <body>

#

# This is paragraph

#

# two

#

# .

#

# </body>

# </html>

soup.p.replaceWith (soup.b)

# <html>

# <head>

# <title id="theTitle">

# New title

# </title>

# </head>

# <body>

#

# two

#

# </body>

# </html>

soup.body.insert(0, "This page used to have ")

soup.body.insert(2, " tags!")

soup.body

# <body>This page used to have two tags!</body>

一个实际例子，用于抓取 ICC Commercial Crime Services weekly piracy report页面, 使用Beautiful Soup剖析并获得发生的盗

版事件:

import urllib2

from BeautifulSoup import BeautifulSoup

page = urllib2.urlopen("http://www.icc-ccs.org/prc/piracyreport.php")

soup = BeautifulSoup (page)

for incident in soup('td', width="90%"):

where, linebreak, what = incident.contents[:3]

print where.strip()

print what.strip()

最新Java全栈就业实战课程(免费)

AI人工智能学习大礼包

IDEA永久激活

66套java实战课程无套路领取

锋哥开始收Java学员啦！

Python学习路线图

Beautiful_Soup中文文档 PDF 下载