r - 文字列から URL を削除する

Question

R には、次のような文字列のベクトル (<code>myStrings) があります。

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

whereanother urlは有効な http URL ですが、stackoverflow では複数の URL を挿入できません。そのため、another url代わりに書いています。すべての URL を次のように削除したいmyStrings:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

stringrパッケージで多くの機能を試しましたが、何も機能しません。

score 18 · Accepted Answer

正規表現を使用gsubして URL を一致させることができます。

ベクトルを設定します。

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

各文字列からすべての URL を削除します。

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"

更新:何を扱っているかを把握できるように、いくつかの異なる URL を投稿していただければ幸いです。しかし、この正規表現は、コメントで言及された URL に対して機能すると思います。

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上記の式は次のように説明されています。

?オプションのスペース
(f|ht)一致"f"または"ht"
tpマッチ"tp"
(s?)"s"存在する場合はオプションで一致
(://)マッチ"://"
(.*)までのすべての文字 (すべて) に一致
[.|/]ピリオドまたはスラッシュ
(.*)その後のすべて

私は正規表現の専門家ではありませんが、正しく説明できたと思います。

注: URL 短縮サービスは SO 回答では許可されなくなったため、最新の編集中にセクションを削除する必要がありました。その部分の編集履歴を参照してください。

score 9 · Accepted Answer

私は、このような一般的なタスクのための正規表現の定型化されたグループに取り組んでおり、最終的に CRAN に移動するgithub のパッケージ qdapRegex に投入しました。また、ピースを抽出するだけでなく、サブアウトすることもできます。パッケージをご覧になった方からのフィードバックをお待ちしております。

ここにあります：

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

編集ツイッターのリンクが削除されていないことがわかりました。これを関数固有の正規表現に追加するつもりはありませんがrm_url、の辞書に追加しましたqdapRegex。したがって、標準の URL と twitter の両方を削除する特定の関数はありませんが、pastex(正規表現の貼り付け) を使用すると、辞書から正規表現を簡単に取得して、それらを一緒に貼り付けることができます (パイプ演算子を使用|)。すべてのrm_XXXスタイル関数は本質的に同じように機能するため、pastex出力をpattern任意の関数の引数に渡すrm_XXXか、以下に示すように独自の関数を作成できます。

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

r - 文字列から URL を削除する

4 に答える 4

Related

Reference