unix - 1行のテキストからパターンのすべてのインスタンスを取得し、編集して、行で区切られたテキストファイルにパイプアウトします

Question

タグと他のジャンクで区切られたURLのリストであるテキストのブロック（1行）があります。'http。*">RSS'に一致するURLについてそのブロック全体を解析し、そのパターンのすべてのインスタンスを編集して（globの後のすべてを削除するため）、すべてを行としてファイルにパイプします-別々のエントリ。

GREPでこれを行うことができると思いました（次にSEDで新しい行を編集して追加します）が、GREPは一致するパターンではなく、一致する行を取得します。使用すべき別のコマンドはありますか？また、SEDを使用して、パターンの前に改行（\ n）を追加しようとしましたが、それも機能していません。

編集：これが私が扱っているデータの例です：

OUT</a> (<a href="https://evilcakes.wordpress.com/rss">RSS</a>)</li><li><a href="http://eater.com/" title="">Eater National</a> (<a href="http://feeds.feedburner.com/EaterNational">RSS</a>)</li><li><a href="http://www.foodtechconnect.com" title="">Food+Tech Connect</a> (<a href="http://feeds.feedburner.com/foodtechconnect">RSS</a>)</li><li><a href="http://www.innatthecrossroads.com" title="">Inn at the Crossroads</a> (<a href="http://innatthecrossroads.com/feed/">RSS</a>)</li><li><a href="http://www.seriouseats.com/" title="">Serious Eats</a> (<a href="http://feeds.seriouseats.com/seriouseatsfeaturesvideos">RSS</a>)</li><li><a href="http://www.thatsnerdalicious.com" title="">That's Nerdalicious!</a> (<a href="http://www.thatsnerdalicious.com/feed/">RSS</a>)</li><li><a href="http://thedrunkenmoogle.com/" title="">The Drunken Moogle</a> (<a href="http://www.thedrunkenmoogle.com/rss">RSS</a>)</li></ul></li><li><h2 class="entry-title">Comics</h2><ul class="opmlGroup"><li><a

score 3 · Accepted Answer

使用する1つの方法は次のGNU grepとおりです。

grep -oP 'http[^"]*(?=">RSS)' file

結果：

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

オプション：

-o, --only-matching
    Print only the matched (non-empty) parts of a matching line, with each such 
    part on a separate output line.
-P, --perl-regexp
    Interpret PATTERN as a Perl regular expression. This is highly experimental
    and grep -P may warn of unimplemented features.

また、ルックアラウンドアサーションを確認することもできます。HTH。

編集：

これを使用する別の方法がありawkます：

awk -F\" '{ for(i=1;i<NF;i++) if ($(i+1) ~ /RSS/) print $i }' file

結果：

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

score 3 · Accepted Answer

これはあなたのために働くかもしれません（GNU sed）：

sed '/https\?:[^"]*/!d;s//\n&\n/;s/^[^\n]*\n//;P;D' file

score 1 · Accepted Answer

サンプルデータをurls.datに入れました。

cat urls.dat | awk '{n=split($0,a,"\""); for (i=1;i<=n;i++) if ( match( a[i], "^http" ) ) print a[i]; }'

score 1 · Accepted Answer

GNUおよびBSDgrepで機能する1つの方法は次のとおりです。

<infile grep -Eo 'https?://[^"]+">RSS*' | grep -Eo '^[^"]+'

出力：

https://evilcakes.wordpress.com/rss
http://feeds.feedburner.com/EaterNational
http://feeds.feedburner.com/foodtechconnect
http://innatthecrossroads.com/feed/
http://feeds.seriouseats.com/seriouseatsfeaturesvideos
http://www.thatsnerdalicious.com/feed/
http://www.thedrunkenmoogle.com/rss

unix - 1行のテキストからパターンのすべてのインスタンスを取得し、編集して、行で区切られたテキストファイルにパイプアウトします

4 に答える 4

Related

Reference