用 rvest 进行基本刮擦
rvest
是 Hadley Wickham 的网页抓取和解析包,灵感来自 Python 的 Beautiful Soup 。它利用 Hadley 的 xml2
软件包的 libxml2
绑定进行 HTML 解析。
作为 tidyverse 的一部分,rvest
是用管道输送的 。它用
xml2::read_html
来刮取网页的 HTML,- 然后可以使用 CSS 或 XPath 选择器将其
html_node
和html_nodes
函数作为子集 - 使用
html_text
和html_table
等函数解析为 R 对象。
要从 R 上的维基百科页面中删除里程碑表,代码看起来就像
library(rvest)
url <- 'https://en.wikipedia.org/wiki/R_(programming_language)'
# scrape HTML from website
url %>% read_html() %>%
# select HTML tag with class="wikitable"
html_node(css = '.wikitable') %>%
# parse table into data.frame
html_table() %>%
# trim for printing
dplyr::mutate(Description = substr(Description, 1, 70))
## Release Date Description
## 1 0.16 This is the last alpha version developed primarily by Ihaka
## 2 0.49 1997-04-23 This is the oldest source release which is currently availab
## 3 0.60 1997-12-05 R becomes an official part of the GNU Project. The code is h
## 4 0.65.1 1999-10-07 First versions of update.packages and install.packages funct
## 5 1.0 2000-02-29 Considered by its developers stable enough for production us
## 6 1.4 2001-12-19 S4 methods are introduced and the first version for Mac OS X
## 7 2.0 2004-10-04 Introduced lazy loading, which enables fast loading of data
## 8 2.1 2005-04-18 Support for UTF-8 encoding, and the beginnings of internatio
## 9 2.11 2010-04-22 Support for Windows 64 bit systems.
## 10 2.13 2011-04-14 Adding a new compiler function that allows speeding up funct
## 11 2.14 2011-10-31 Added mandatory namespaces for packages. Added a new paralle
## 12 2.15 2012-03-30 New load balancing functions. Improved serialization speed f
## 13 3.0 2013-04-03 Support for numeric index values 231 and larger on 64 bit sy
虽然这会返回一个 data.frame,但请注意,对于已删除的数据而言,仍然需要进一步清理数据:此处,格式化日期,插入 NA
s 等等。
请注意,不太一致的矩形格式的数据可能需要循环或其他进一步的调整才能成功解析。如果网站使用 jQuery 或其他方式插入内容,read_html
可能不足以刮擦,并且可能需要像 RSelenium
这样更强大的刮刀。