html - 値が欠落しているRのリストからhrefを抽出する

Question

一連の jekyll テーマのソースとデモの URL を data.frame に抽出したいと考えています。

library(rvest)

info <- read_html("https://github.com/jekyll/jekyll/wiki/themes")

data <- info %>%
 html_nodes(" #wiki-body li")

data
{xml_nodeset (115)}


[11] <li>Typewriter - (<a href="https://github.com/alixedi/typewriter">source</a>, <a href="http://alixedi.github.io/typewriter">demo</a>)</li>
[12] <li>block-log - (<a href="https://github.com/anandubajith/block-log">source</a>), <a href="https://anandu.net/demo/block-log/">demo</a>)</li>
[13] <li>Otter Pop - (<a href="https://github.com/tybenz/otter-pop">source</a>)</li>

したがって、3列のdata.frame(df)が必要です。

name        source                                       demo
Typewriter   https://github.com/alixedi/typewriter         http://alixedi.github.io/typewriter

すべての href をベクトルとして抽出することはできますが、ご覧のとおり、[13] からいくつかのサイトのデモがないため、問題が発生します。

データから df を作成する簡単な方法はありますか? おそらくpurrrライブラリを使用

score 3 · Accepted Answer

これがあなたのpurrr答えです：

library(rvest)
library(purrr)
library(dplyr)

info <- read_html("https://github.com/jekyll/jekyll/wiki/themes")

themes <- html_nodes(info, xpath=".//div[@class='markdown-body']/*/li")

zero_to_na <- function(x) { ifelse(length(x)==0, NA, x) }

df <- data_frame(name=gsub(" [- ]*\\(.*$", "", html_text(themes)),
                 source=map_chr(themes, ~html_attr(html_nodes(., xpath=".//a[contains(., 'source')]"), "href")),
                 demo=map_chr(themes, ~zero_to_na(html_attr(html_nodes(., xpath=".//a[contains(., 'demo')]"), "href"))))

glimpse(df)
## Observations: 115
## Variables: 3
## $ name   <chr> "Jalpc", "Pixyll", "Jekyll Metro", "Midnight", "Leap Day", "F...
## $ source <chr> "https://github.com/Jack614/jalpc_jekyll_theme", "https://git...
## $ demo   <chr> "http://www.jack003.com", "http://pixyll.com/", "http://blog-...

交互に:

map_df(themes, function(x) {
  data_frame(name=gsub(" [- ]*\\(.*$", "", html_text(x)),
             source=html_attr(html_nodes(x, xpath=".//a[contains(., 'source')]"), "href"),
             demo=zero_to_na(html_attr(html_nodes(x, xpath=".//a[contains(., 'demo')]"), "href")))
})

gsub/ sub/etc 「名前」の不要な部分。

score 2 · Accepted Answer

xpath を使用して、デモデータを含むものとデモデータを含まないものを別々に収集して、2 つのグループを分離できます。

withDemo <- info %>%
    html_nodes(xpath = "//li[contains(., 'source') and contains(., 'demo')]")

withoutDemo <- info %>%
    html_nodes(xpath = "//li[contains(., 'source') and not(contains(.,'demo'))]")

次に、ソースとデモのリンクを含むコレクションのデータフレームを作成します。

sourceNdemo <- withDemo %>%
    html_children() %>%              # get all children
    html_attr("href") %>%            # get the href attributes
    matrix(ncol = 2, byrow = TRUE)   # 2 pieces of data for each row

sourceNdemo <- setNames(
    data.frame(html_text(withDemo), sourceNdemo),  # html_text to get "name" column
    c("name", "source", "demo"))

次に、ソースデータのみのデータフレームを作成します。

source <- withoutDemo %>% 
    html_children() %>%
    html_attr("href")

# set demo = NA for easy rbind-ing
source <- data.frame(name = html_text(withoutDemo), source = source, demo = NA)

rbind2 つのデータフレーム

allInfo <- rbind(sourceNdemo, source)

「名前」列には、「Jalpc - (ソース、デモ)」や「Bitwiser-Material (ソース、デモ)」などのエントリが含まれるようになりました。gsub を使用して、余分な「(ソース、デモ)」ビットを取り除くことができます。

allInfo$name <- sub("\\s(-\\s)?\\(.+$", "", allInfo$name, perl = TRUE)

html - 値が欠落しているRのリストからhrefを抽出する

3 に答える 3

Related

Reference