java - j2meでのHTMLテキスト抽出

Question

私は次のようなhtml Webページからの文字列を持っています:

String htmlString =

<span style="mso-bidi-font-family:Gautami;mso-bidi-theme-font:minor-bidi">President Pranab pay great 
tributes to Motilal Nehru on occasion of 
</span>
150th birth anniversary. Pranab said institutions evolved by 
leaders like him should be strengthened instead of being destroyed. 
<span style="mso-spacerun:yes">&nbsp;
</span>
He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of 
Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly,   
the first set of coins and postal stamps released at the function to commemorate the event.
</p>

上記の文字列からテキストを抽出する必要があります。抽出後、出力は次のようになります

出力:

President Pranab pay great tributes to Motilal Nehru on occasion of 150th birth anniversary. Pranab said institutions evolved by leaders like him should be strengthened instead of being destroyed.  He listed his achievements like his role in evolving of Public Accounts Committee and protecting independence of Legislature from the influence of the Executive by establishing a separate cadre for the Central Legislative Assembly, now Parliament. Calling himself a student of history, he said Motilal's Swaraj Party acted as a disciplined assault force in the Legislative Assembly and he was credited with evolving the system of a Public Accounts Committee which is now one of the most effective watchdogs over executive in matters of money and finance. Mukherjee also received the first set of coins and postal stamps released at the function to commemorate the event.

このために、以下のロジックを使用しました。

int spanIndex = content.indexOf("<span");
spanIndex = content.indexOf(">", spanIndex);
int endspanndex = content.indexOf("</span>", spanIndex);
content = content.substring(spanIndex  + 1, endspanndex);

私の結果の出力は次のとおりです。

President Pranab pay great tributes to Motilal Nehru on occasion of

さまざまな HTMLParser を使用しましたが、j2me の場合は機能しません

完全な説明テキストを取得するのを手伝ってくれる人はいますか? ありがとう .....

score 2 · Accepted Answer

BlackBerry OS 5.0 以降を使用している場合は、BrowserField を使用してHTML を DOM ドキュメントに解析できます。

score 1 · Accepted Answer

文字列の残りの部分で提案するのと同じ方法で続行できます。あるいは、単純な有限状態オートマトンがこれを解決します。私はmoJabプロジェクトでそのような解決策を見てきました（ソースはここからダウンロードできます）。パッケージには、mojab.xmlj2me用に設計された最小限のXMLパーサーが含まれています。つまり、例も解析されます。ソースを見てください、それはたった3つの単純な留め金です。そのまま使用できるようです。

score 1 · Accepted Answer

次のように、HTMLParser をサポートしていないため、j2me の場合はテキストを抽出できます。

private String removeHtmlTags(String content) {

        while (content.indexOf("<") != -1) {

            int beginTag;
            int endTag;

            beginTag = content.indexOf("<");
            endTag = content.indexOf(">");
            if (beginTag == 0) {
                content = content.substring(endTag
                        + 1, content.length());
            } else {
                content = content.substring(0, beginTag) + content.substring(endTag
                        + 1, content.length());
            }
        }
        return content;
    }

score 0 · Accepted Answer

JSoupは、HTML ドキュメントからテキストを抽出するための非常に人気のあるライブラリです。これは、同じ例の1つです。

java - j2meでのHTMLテキスト抽出

4 に答える 4

Related

Reference