xml - sedone-liner-キーワードを囲む区切り文字のペアを検索します

Question

私は通常、大きなXMLファイルを処理し、grep特定の統計を確認するために単語数をカウントします。

widgetたとえば、次の方法で1つのxmlファイルに少なくとも5つのインスタンスがあることを確認したいと思います。

cat test.xml | grep -ic widget

さらに、次のように表示される行をログに記録できるようにしたいだけですwidget。

cat test.xml | grep -i widget > ~/log.txt

ただし、私が本当に必要とする重要な情報は、にwidget表示されるXMLコードのブロックです。サンプルファイルは次のようになります。

<test> blah blah
  blah blah blah
  widget
  blah blah blah
</test>

<formula>
  blah
  <details> 
    widget
  </details>
</formula>

上記のサンプルテキストから次の出力を取得しようとしています。

<test>widget</test>

<formula>widget</formula>

事実上、任意の文字列を囲むXMLテキスト/コードのブロックに適用される最高レベルのマークアップタグを含む1行を取得しようとしていますwidget。

コマンドラインのワンライナーを介してこれを実装するための提案はありますか？

ありがとうございました。

score 3 · Accepted Answer

sedとの両方を使用する非エレガントな方法awk:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}' file.txt | awk 'NR%2==1 { sub(/^[ \t]+/, ""); search = $0 } NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }'

結果：

<test>widget</test>
<formula>widget</formula>

説明：

## The sed pipe:

sed -ne '/[Ww][Ii][Dd][Gg][Ee][Tt]/,/^<\// {//p}'
## This finds the widget pattern, ignoring case, then finds the last, 
## highest level markup tag (these must match the start of the line)
## Ultimately, this prints two lines for each pattern match

## Now the awk pipe:

NR%2==1 { sub(/^[ \t]+/, ""); search = $0 }
## This takes the first line (the widget pattern) and removes leading
## whitespace, saving the pattern in 'search'

NR%2==0 { end = $0; sub(/^<\//, "<"); printf "%s%s%s\n", $0, search, end }
## This finds the next line (which is even), and stores the markup tag in 'end'
## We then remove the slash from this tag and print it, the widget pattern, and
## the saved markup tag

HTH

score 2 · Accepted Answer

これはあなたのために働くかもしれません（GUN sed）：

sed '/^<[^/]/!d;:a;/^<\([^>]*>\).*<\/\1/!{$!N;ba};/^<\([^>]*>\).*\(widget\).*<\/\1/s//<\1\2<\/\1/p;d' file

score 2 · Accepted Answer

 sed -nr '/^(<[^>]*>).*/{s//\1/;h};/widget/{g;p}' test.xml

版画

<test>
<formula>

必要な正確な形式を印刷すると、ワンライナーのみの Sed がより複雑になります。

編集： gnu sedの大文字と小文字を区別しない一致の代わりに
使用できます。それ以外の場合は、他の回答のようにすべての文字に使用できます。/widget/I/widget/widget[Ww]

score 1 · Accepted Answer

gawk正規表現が必要ですRS

BEGIN {
    # make a stream of words
    RS="(\n| )"
}

# match </tag>
/<\// {
    s--
    next
}

# match <tag>
/</ {
    if (!s) {
    tag=substr($0, 2)
    }
    s++
}

$0=="widget" {
    print "<" tag $0 "</" tag
}

xml - sedone-liner-キーワードを囲む区切り文字のペアを検索します

4 に答える 4

Related

Reference