html - html/xml を削除する最も簡単な方法単一行出力から

Question

次のようなクリーンアップしようとしているgrepからの出力があります。

<words>Http://www.path.com/words</words>

使ってみた...

sed 's/<.*>//'

...タグを削除しますが、それは行全体を破壊するだけです。すべての「<」はコンテンツに到達する前に「>」で閉じられているため、なぜそうなっているのかわかりません。

これを行う最も簡単な方法は何ですか?

ありがとう！

score 8 · Accepted Answer

sed式でこれを試してください：

sed 's/<.*>\(.*\)<\/.*>/\1/'

式の簡単な内訳:

<.*>   - Match the first tag
\(.*\) - Match and save the text between the tags   
<\/.*> - Match the end tag making sure to escape the / character  
\1     - Output the result of the first saved match 
       -   (the text that is matched between \( and \))

後方参照の詳細

完全を期すためにおそらく対処されるべきであるという質問がコメントに出てきました。

\(とは\)、Sed の後方参照マーカーです。後で使用するために、一致した式の一部を保存します。

たとえば、入力文字列がある場合:

これには (括弧) が含まれています。さらに、後方参照を使用して、この括弧のように括弧を使用できます。

次の式を作成します。

sed s/.*(\(.*\)).*\1\\(.*\)\1.*/\1 \2/

これにより、次のことがわかります。

parens like this

それは一体どのように機能したのですか？式を分解して調べてみましょう。

式の内訳:

sed s/ - This is the opening tag to a sed expression.
.*     - Match any character to start (as well as nothing).
(      - Match a literal left parenthesis character.
\(.*\) - Match any character and save as a back-reference. In this case it will match anything between the first open and last close parenthesis in the expression.
)      - Match a literal right parenthesis character.
.*     - Same as above.
\1     - Match the first saved back-reference. In the case of our sample this is filled in with `parens`
\(.*\) - Same as above.
\1     - Same as above.
/      - End of the match expression. Signals transition to the output expression.
\1 \2  - Print our two back-references.
/      - End of output expression.

ご覧のとおり、括弧 ((と)) の間から取得された後方参照は、一致する式に置き換えられ、文字列と一致できるようになりましたparens。

html - html/xml を削除する最も簡単な方法単一行出力から

1 に答える 1

Related

Reference