r - Google Websearch から情報を抽出するには、R の rvest にどのセレクターを記述しますか?

Question

<h3 class="r">下の画像のようなGoogle Websearchのコンテンツをダウンロードしようとしています

rvestパッケージを使用してRでそのようなセレクターを記述しようとしましたが、結果はありませんでした。セレクターがどのように見えるべきか誰でも知っていますか?

> library(rvest)
> 
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>% 
+    html_nodes( "h3[class=r]" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"
> googleContent <- html("https://www.google.pl/#q=wiadomosci") %>% 
+    html_nodes( "h3.r" )
> googleContent
list()
attr(,"class")
[1] "XMLNodeSet"

他のパッケージも試しましたが、面倒なコードは好きではありません... (この記事のコードを変更)

> # load packages
> library(RCurl)
> library(XML)
> library(dplyr)
> get_google_page_urls <- function(u) {
+    # read in page contents
+    html <- getURL(u)
+    
+    # parse HTML into tree structure
+    doc <- htmlParse(html)
+    
+    # extract url nodes using XPath. Originally I had used "//a[@href][@class='l']" until the google code change.
+    links <- xpathApply(doc, "//h3//a[@href]", function(x) xmlAttrs(x)[[1]])
+    
+    # free doc from memory
+    free(doc)
+    
+    # ensure urls start with "http" to avoid google references to the search page
+    links <- grep("http://", links, fixed = TRUE, value=TRUE)
+    return(links)
+ }
> 
> u <- "http://www.google.pl/search?aq=f&gcx=w&sourceid=chrome&ie=UTF-8&q=wiadomosci"
>  get_google_page_urls(u) %>% grep( pattern = "/url", value = TRUE) %>% strsplit( "?q=") %>%
+    lapply( function(element){ strsplit( element[2], ".pl" )[[1]][1] } ) %>%
+    unlist() %>% paste0(".pl") %>% unique()
[1] "http://wiadomosci.onet.pl"   "http://www.tvn24.pl"         "http://tvnwarszawa.tvn24.pl"
[4] "http://wiadomosci.wp.pl"     "http://warszawa.gazeta.pl"   "http://wiadomosci.gazeta.pl"
[7] "http://wiadomosci.tvp.pl"    "http://www.se.pl"

これは何とか役立つでしょうか？ドキュメントが非常に貧弱であるため、この機能を理解できません

search <- html_form(html("https://www.google.com"))[[1]]


set_values(search, q = "My little pony")

r - Google Websearch から情報を抽出するには、R の rvest にどのセレクターを記述しますか?

0 に答える 0

Related

Reference