linux - シェルスクリプトを使用して単語を検索し、その単語の後に35文字をエクスポートしますか？

Question

input.txt奇妙な文字、htmlタグ、便利な資料がたくさん含まれているファイルがあります。新しいファイルoutput.txtで、htmlタグのdescriptionような奇妙な文字を除いて単語の後に35文字を表示したいと思います。$$#$#@$#@***$#助けて。よろしくお願いします。

私の最終的な目標は、単語の説明を見つけて、説明の後にHTMLタグや奇妙な文字を含めてはならない35文字を印刷することです。出来ますか？ここみたいに：

<description>&lt;p&gt;&lt;img class="float_right"
 src="http://static3.businessinsider.com/image/502ab0036bb3f7147b00000f-400-300/dnu.jpg"
 border="0" alt="dnu" width="400" height="300" /&gt;&lt;/p&gt;&lt;p&gt;The lawn
 was filled with &lt;a class="hidden_link"
 href="http://www.businessinsider.com/blackboard/goldman-sachs"&gt;Goldman
 Sachs&lt;/a&gt; Group Inc. partners dressed in pink looking out on a pink sunset.

開始したい:(The lawn is filled withここでもこれらのタグをスキップして続行します）Group Inc. partners（35文字.done！）停止して、別の説明を検索します！

score 1 · Accepted Answer

XPathを使用して、HTMLノード内のすべてのテキストを選択できます。あなたの場合、これはうまくいくはずです：

xpath -q -e '//description//text()' input.txt

クエリ//description//text()は次のように機能します。

//description：という名前のノードが見つかるまでHTMLドキュメントをドリルダウンしますdescription
//text()：このノード内で他のすべてのノードをドリルダウンし、それらのテキストを選択します

あなたのデータを考えると、これは次のように出力します。

The lawn was filled with 
Goldman Sachs
 Group Inc. partners dressed in pink looking out on a pink sunset.

linux - シェルスクリプトを使用して単語を検索し、その単語の後に35文字をエクスポートしますか？

1 に答える 1

Related

Reference