需要登录时使用 rvest
我在报废网页时遇到的常见问题是如何输入用户 ID 和密码登录网站。
在这个我创建的示例中,用于跟踪我在此处发布的堆栈溢出的答案。整体流程是登录,转到网页收集信息,添加数据帧然后转到下一页。
library(rvest)
#Address of the login webpage
login<-"https://stackoverflow.com/users/login?ssrc=head&returnurl=http%3a%2f%2fstackoverflow.com%2f"
#create a web session with the desired login address
pgsession<-html_session(login)
pgform<-html_form(pgsession)[[2]] #in this case the submit is the 2nd form
filled_form<-set_values(pgform, email="*****", password="*****")
submit_form(pgsession, filled_form)
#pre allocate the final results dataframe.
results<-data.frame()
#loop through all of the pages with the desired info
for (i in 1:5)
{
#base address of the pages to extract information from
url<-"http://stackoverflow.com/users/**********?tab=answers&sort=activity&page="
url<-paste0(url, i)
page<-jump_to(pgsession, url)
#collect info on the question votes and question title
summary<-html_nodes(page, "div .answer-summary")
question<-matrix(html_text(html_nodes(summary, "div"), trim=TRUE), ncol=2, byrow = TRUE)
#find date answered, hyperlink and whether it was accepted
dateans<-html_node(summary, "span") %>% html_attr("title")
hyperlink<-html_node(summary, "div a") %>% html_attr("href")
accepted<-html_node(summary, "div") %>% html_attr("class")
#create temp results then bind to final results
rtemp<-cbind(question, dateans, accepted, hyperlink)
results<-rbind(results, rtemp)
}
#Dataframe Clean-up
names(results)<-c("Votes", "Answer", "Date", "Accepted", "HyperLink")
results$Votes<-as.integer(as.character(results$Votes))
results$Accepted<-ifelse(results$Accepted=="answer-votes default", 0, 1)
在这种情况下,循环仅限于 5 页,这需要更改以适合你的应用程序。我用******替换了用户特定的值,希望这将为你提供一些指导。