Python使用Beautiful Soup及解析html获取元素并提取内容值
- 1. 包括解析获取标题
- 2. 根据标签及id获取所有元素
- 3. 根据标签及class获取所有元素
- 4. 获取元素下的标签的值
- 5. 获取元素下的parent及child的元素的值
- 参考
1. 包括解析获取标题
2. 根据标签及id获取所有元素
3. 根据标签及class获取所有元素
4. 获取元素下的标签的值
5. 获取元素下的parent及child的元素的值
from bs4 import BeautifulSoupfile_html = 'test/demo.html'
file = open(file_html, "rb")
html = file.read().decode("utf-8")
bs = BeautifulSoup(html, "html.parser")print("获取文章title")
print(bs.title)
id_list = bs.find_all('input', id='mSearchInput')
div_class_list = bs.find_all('div', class_='view-num-box')
for i, div in enumerate(div_class_list):print(i, div.text, ' parent: ', div.parent.text)print('-----------------------------------------------------------')
blog_list = bs.find_all('article', class_='blog-list-box')
for i, blog in enumerate(blog_list):print(i, blog.text, '\ntitle: ', bs.find_all('div', class_='blog-list-box-top')[i].text)print(blog.h4.text) print(blog.span.text)print(blog.div, blog.div.next)for j, content in enumerate(blog.contents):print('contents: ', j, content.text)for j, child in enumerate(blog.children):print('child: ', j, child.text)div_list = bs.find_all('div', class_='user-profile-head-address')
print('div_list: ', div_list[0].text)meta_list = bs.find_all('meta')
for j, meta in enumerate(meta_list):print(j, meta.text, meta.attrs['content'])
print("2. NavigableString的例子:获取title的string内容和div的属性")
print(bs.title.string)
print(bs.div.attrs)
print("3. BeautifulSoup的例子:获取整个html文档的name")
print(bs.name)
print("4. Comment的例子:获取a的string")
print(bs.a.string)
参考
- https://blog.csdn.net/qq_42732153/article/details/81105725
- https://blog.csdn.net/qq_50587771/article/details/123870433