用 rvest 进行基本刮擦

Created: November-22, 2018

rvest 是 Hadley Wickham 的网页抓取和解析包，灵感来自 Python 的 Beautiful Soup 。它利用 Hadley 的 xml2 软件包的 libxml2 绑定进行 HTML 解析。

作为 tidyverse 的一部分，rvest 是用管道输送的。它用

xml2::read_html 来刮取网页的 HTML，
然后可以使用 CSS 或 XPath 选择器将其 html_node 和 html_nodes 函数作为子集
使用 html_text 和 html_table 等函数解析为 R 对象。

要从 R 上的维基百科页面中删除里程碑表，代码看起来就像

library(rvest)

url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'

        # scrape HTML from website
url %>% read_html() %>% 
    # select HTML tag with class="wikitable"
    html_node(css = '.wikitable') %>% 
    # parse table into data.frame
    html_table() %>%
    # trim for printing
    dplyr::mutate(Description = substr(Description, 1, 70))

##    Release       Date                                                  Description
## 1     0.16            This is the last alpha version developed primarily by Ihaka 
## 2     0.49 1997-04-23 This is the oldest source release which is currently availab
## 3     0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4   0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5      1.0 2000-02-29 Considered by its developers stable enough for production us
## 6      1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7      2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data 
## 8      2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9     2.11 2010-04-22                          Support for Windows 64 bit systems.
## 10    2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11    2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12    2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13     3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy

虽然这会返回一个 data.frame，但请注意，对于已删除的数据而言，仍然需要进一步清理数据：此处，格式化日期，插入 NAs 等等。

请注意，不太一致的矩形格式的数据可能需要循环或其他进一步的调整才能成功解析。如果网站使用 jQuery 或其他方式插入内容，read_html 可能不足以刮擦，并且可能需要像 RSelenium 这样更强大的刮刀。