用 rvest 進行基本刮擦

Created: November-22, 2018

rvest 是 Hadley Wickham 的網頁抓取和解析包，靈感來自 Python 的 Beautiful Soup 。它利用 Hadley 的 xml2 軟體包的 libxml2 繫結進行 HTML 解析。

作為 tidyverse 的一部分，rvest 是用管道輸送的。它用

xml2::read_html 來刮取網頁的 HTML，
然後可以使用 CSS 或 XPath 選擇器將其 html_node 和 html_nodes 函式作為子集
使用 html_text 和 html_table 等函式解析為 R 物件。

要從 R 上的維基百科頁面中刪除里程碑表，程式碼看起來就像

library(rvest)

url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'

        # scrape HTML from website
url %>% read_html() %>% 
    # select HTML tag with class="wikitable"
    html_node(css = '.wikitable') %>% 
    # parse table into data.frame
    html_table() %>%
    # trim for printing
    dplyr::mutate(Description = substr(Description, 1, 70))

##    Release       Date                                                  Description
## 1     0.16            This is the last alpha version developed primarily by Ihaka 
## 2     0.49 1997-04-23 This is the oldest source release which is currently availab
## 3     0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4   0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5      1.0 2000-02-29 Considered by its developers stable enough for production us
## 6      1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7      2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8      2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9     2.11 2010-04-22                          Support for Windows 64 bit systems.
## 10    2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11    2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12    2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13     3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy

雖然這會返回一個 data.frame，但請注意，對於已刪除的資料而言，仍然需要進一步清理資料：此處，格式化日期，插入 NAs 等等。

請注意，不太一致的矩形格式的資料可能需要迴圈或其他進一步的調整才能成功解析。如果網站使用 jQuery 或其他方式插入內容，read_html 可能不足以刮擦，並且可能需要像 RSelenium 這樣更強大的刮刀。