regex - sed 正規表現は後読みと先読みをシミュレートできますか?

Question

すべての「裸の」URLをテキストファイルにキャプチャし、それらをに置き換えるsedスクリプトを作成しようとしています<a href=[URL]>[URL]</a>. 「ネイキッド」とは、アンカータグ内にラップされていない URL を意味します。

最初に考えたのは、先頭に " または > がなく、後に < または " がない URL に一致する必要があるということでした。しかし、私が知る限り、sed には先読みや後読みがないため、「前も後ろも持たない」という概念を表現するのに苦労しています。

サンプル入力:

[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]

必要な出力の例:

[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foo.bar">http://foo.bar</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]

3 行目は既に内にあるため、変更されていないことに注意して<a href>ください。一方、1 行目と 2 行目の両方が変更されます。最後に、URL 以外のすべてのテキストが変更されていないことを確認します。

最終的に、私は次のようなことをしようとしています:

sed s/[^>"](http:\/\/[^\s]\+)/<a href="\1">\1<\/a>/g 2-7-2013

まず、次の URL が正しく一致して URL を削除することを確認しました。

sed 's/http:\/\/[^\s]\+//g'

次にこれを試しましたが、ファイル/入力の先頭から始まる URL と一致することはできません。

sed 's/[^\>"]http:\/\/[^\s]\+//g'

後読み/先読みをシミュレートするか、ファイルの先頭とファイルの終わりを明示的に一致させることにより、sedでこれを回避する方法はありますか?

score 4 · Accepted Answer

sed は、単一行での単純な置換のための優れたツールです。その他のテキスト操作の問題については、awk を使用するだけです。

URL に一致する正規表現については、以下の BEGIN セクションで使用している定義を確認してください。あなたのサンプルでは機能しますが、可能なすべての URL 形式をキャプチャできるかどうかはわかりません。そうでなくても、ニーズには十分かもしれません。

$ cat file
[Beginning of File]http://foo.bar arbitrary text
http://test.com other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! http://yahoo.com[End of File]
$
$ awk -f tst.awk file
[Beginning of File]<a href="http://foo.bar">http://foo.bar</a> arbitrary text
<a href="http://test.com">http://test.com</a> other text
<a href="http://foobar.com">http://foobar.com</a>
Nearing end of file!!! <a href="http://yahoo.com">http://yahoo.com</a>[End of File]
$
$ cat tst.awk
BEGIN{ urlRe="http:[/][/][[:alnum:]._]+" }
{
    head = ""
    tail = $0
    while ( match(tail,urlRe) ) {
       url  = substr(tail,RSTART,RLENGTH)
       href = "href=\"" url "\""

       if (index(tail,href) == (RSTART - 6) ) {
          # this url is inside href="url" so skip processing it and the next url match.
          count = 2
       }

       if (! (count && count--)) {
          url = "<a " href ">" url "</a>"
       }

       head = head substr(tail,1,RSTART-1) url
       tail = substr(tail,RSTART+RLENGTH)
    }

    print head tail
}

score 2 · Accepted Answer

コマンドの明らかな問題は

You did not escape the parenthesis "("

これはsed正規表現の奇妙な点です。多くのシンボルがデフォルトで「リテラル」であることは、Perl 正規表現とは異なります。それらを「機能」させるにはエスケープする必要があります。試す：

s/\([^>"]\?\)\(http:\/\/[^\s]\+\)/\1<a href="\2">\2<\/a>/g

regex - sed 正規表現は後読みと先読みをシミュレートできますか?

2 に答える 2

Related

Reference