html - R を使用して HTML からダッシュを含む URL を抽出するにはどうすればよいですか?

Question

次のような HTML があります。

<ul><li><a href="http://www.website.com/index.aspx" target="_blank">Website</a></li>
<li><a href="http://website.com/index.html" target="_blank">Website</a></li>
<li><a href="http://www.website-with-dashes.org" target="_blank">Website With Dashes</a></li>
<li><a href="http://website2.org/index.htm" target="_blank">Website 2</a></li>
<li><a href="http://www.another-site.com/">Another Site</a></li>

使用して

m<-regexpr("http://\\S*/?", links, perl=T)
links<-regmatches(links, m)

リンクを取得しますが、ダッシュが含まれているものは次のように切り捨てられます。

http://www.website.com/index.aspx
http://website.com/index.html
http://www.website
http://website2.org/index.htm
http://www.another-site.com/

/S は空白以外のすべてに一致すると思いました。どうしたの？

score 4 · Accepted Answer

使用するXML::getHTMLlinks

例えば

library(XML)
# assuming your html document is'foo.html')

 getHTMLLinks(doc = 'foo.html')
# [1] "http://www.website.com/index.aspx"  "http://website.com/index.html"      "http://www.website-with-dashes.org"
# [4] "http://website2.org/index.htm"      "http://www.another-site.com/"

正規表現による解析HTMLは必ずしも簡単ではありません。https://stackoverflow.com/a/1732454/1385941は興味深い読み物です。

html - R を使用して HTML からダッシュを含む URL を抽出するにはどうすればよいですか?

1 に答える 1

Related

Reference