一個 BeautifulSoup Hello World 刮痧的例子

Created: November-22, 2018

from bs4 import BeautifulSoup
import requests

main_url = "https://fr.wikipedia.org/wiki/Hello_world"
req = requests.get(main_url)
soup = BeautifulSoup(req.text, "html.parser")

# Finding the main title tag.
title = soup.find("h1", class_ = "firstHeading")
print title.get_text()

# Finding the mid-titles tags and storing them in a list.
mid_titles = [tag.get_text() for tag in soup.find_all("span", class_ = "mw-headline")]

# Now using css selectors to retrieve the article shortcut links
links_tags = soup.select("li.toclevel-1")
for tag in links_tags:
    print tag.a.get("href")

# Retrieving the side page links by "blocks" and storing them in a dictionary
side_page_blocks = soup.find("div",
                            id = "mw-panel").find_all("div",
                                                      class_ = "portal")
blocks_links = {}
for num, block in enumerate(side_page_blocks):
    blocks_links[num] = [link.get("href") for link in block.find_all("a", href = True)]

print blocks_links[0]

輸出：

"Hello, World!" program
#Purpose
#History
#Variations
#See_also
#References
#External_links
[u'/wiki/Main_Page', u'/wiki/Portal:Contents', u'/wiki/Portal:Featured_content', u'/wiki/Portal:Current_events', u'/wiki/Special:Random', u'https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en', u'//shop.wikimedia.org']

在實現美麗湯時輸入你喜歡的解析器避免了通常的 Warning 宣告 no parser was explicitely specified。

可以使用不同的方法來查詢網頁樹中的元素。

雖然存在少數其他方法，但 CSS classes 和 CSS selectors 是在樹中查詢元素的兩種方便方法。

應該注意的是，我們可以通過在搜尋時將其屬性值設定為 True 來查詢標記。

get_text() 允許我們檢索標籤中包含的文字。它將其作為單個 Unicode 字串返回。tag.get("attribute") 允許獲取標籤的屬性值。