regex - Awk match() - 1行に複数

Question

HTML ファイルからリンクを取得するために gawk の match() 関数を使用しています。正規表現は次のようになります。

match($0, /(<a href=\")([^\"]+)/, arr)

最後に「/g」オプションを使用して、行ごとに複数の一致を取得できないようですが?

score 5 · Accepted Answer

それは正しいです。AWK 正規表現にはフラグがありません。また、 2 番目以降の一致
を探すための組み込みサポートはありません。と関数だけがそれを持っています。私はこのようなことを試してみます:match
gsubgensub

gensub(/.*<a href=\"([^\"]+)/, "\1%", "g")
last = split($0, "%", arr)
delete arr[last]

where%は、入力に見つからないことを保証できる文字列です。

score 1 · Accepted Answer

テキストモードブラウザーの lynx は、URL を収集するための優れたツールである可能性があります。フラグは、フォーマットされ-dumpた出力を標準出力に書き込みます。最後に、そのページのすべての表示および非表示リンクの番号付きリストが表示されます。(またはファイル。引数として URL またはファイル名を受け入れます。)

$ lynx -dump http://www.stackoverflow.com 

[snip]
References

   Visible links
   1. http://stackoverflow.com/opensearch.xml
   2. http://stackoverflow.com/feeds
   3. http://stackexchange.com/
   4. http://stackoverflow.com/users/login
   5. http://careers.stackoverflow.com/
   6. http://chat.stackoverflow.com/
[snip]
 676. http://creativecommons.org/licenses/by-sa/3.0/
 677. http://blog.stackoverflow.com/2009/06/attribution-required/

   Hidden links:
 678. http://www.peer1.com/stackoverflow
 679. http://creativecommons.org/licenses/by-sa/3.0/

regex - Awk match() - 1行に複数

2 に答える 2

Related

Reference