xml - Web ハーベスト -- 通常とは異なる文字を削除する

Question

アンカーの後にいくつかのスペースがあるページをスクレイピングしようとしています:

</a>&nbsp;&nbsp;|&nbsp;&nbsp;

テキストを指定する方法が見つからないようで、プロセッサエラーが発生するか、文字列自体の検出に失敗します。文字が含まれていると xml が適切に形成されていないため、それ以降はすべて html から xml への変換が失敗します。そのため、後ですべてを削除する必要があります (ドキュメントの他の場所の後に div タグまたは何か他の部分があることに注意してください)。

私のコード:

<xpath expression="/">
     <regexp replace="true">
            <regexp-pattern>(nbsp;)</regexp-pattern>
                <regexp-source>
                    <html-to-xml omitcomments="true" advancedxmlescape="true" prunetags="head,script,meta,meta ,p,base,br,link,img,image,input,option,nbsp;">
                       <http url="http://mysite.org/map/aindex/" method="get" />
                    </html-to-xml>
                </regexp-source>
                <regexp-result>
                    <template></template>
                </regexp-result>
      </regexp>
</xpath>

私の問題は正規表現パターンにあると思います。私はもう試した：



 &nbsp;  
    \& nbsp;  (without the space in between -- SO doesn't display that correctly
    \s+\|\s+

とりわけ。式を CDATA 要素に入れようとしましたが、これも機能しません。

何かご意見は？

xml - Web ハーベスト -- 通常とは異なる文字を削除する

1 に答える 1

Related

Reference