java - 画面スクレイピング用の Mozilla パーサー

Question

ページの HTML コードを取り込み、ページの特定の要素 (テーブルなど) を抽出し、それらの要素の HTML コードを返すアプリを作成しています。ページのナビゲーションを簡素化するために Mozilla パーサーを使用して Java でこれを実行しようとしていますが、必要な html コードを抽出するのに問題があります。

たぶん、私のアプローチ全体、別名 Mozilla パーサーが間違っているので、より良い解決策があれば、提案をお待ちしています

String html = ///what ever the code is

MozillaParser p = // instantiate parser


// pass in html to parse which creates a dom object
Document d = p.parse(html);

// get a list of all the form elements in the page
NodeList l =  d.getElementsByTagName("form");

// iterate through all forms
for(int i = 0; i < l.getLength(); i++){

    // get a form
    Node n = l.item(i);

    // print out the html code for just this form.
    // This is the portion I haven't figured out.
    // I just made up the innerHTML method, but thats
    // the end result I'm desiring, a way to just see
    // the html code for a particular node
    System.out.println( n.innerHTML() );
}

score 1 · Accepted Answer

私は htmlcleaner ( http://htmlcleaner.sourceforge.net/ ) を使用してある程度の成功を収めました。これは非常に高速で、どの程度「厳密」にするかを決定できるオプションがあります。ただし、すべての明白な理由 (REST または他の形式の API を介して公開されたデータは、より信頼性が高く、合法であり、解析しやすい傾向があるなど) から、可能な限り html スクレイピングを回避しようとしています。

score 1 · Accepted Answer

Mozilla パーサーは、ここではやり過ぎのように思えます。私は、あなたが行っているタイプのことだけで、 Jerichoを使用してある程度の成功を収めました。

score 0 · Accepted Answer

Mozilla プラットフォームで Javascript を使用して HTML ラッパーをコーディングしました。コードを Firefox ブラウザーの 2 つの拡張機能にまとめます。1 つは MetaStudio と呼ばれ、Web ページに意味論的に注釈を付けるデータスキーマ定義ツールです。もう 1 つは DataScraper と呼ばれ、Web ページからデータの断片を抽出して XML ファイルにフォーマットするツールです。

ソースコードはすべて読み取り可能です。http://www.gooseeker.comにアクセスしてダウンロードしてください。

java - 画面スクレイピング用の Mozilla パーサー

3 に答える 3

Related

Reference