java - Extracting Information from websites

Question

Not every website exposes their data well, with XML feeds, APIs, etc

How could I go about extracting information from a website? For example:

...
<div>
  <div>
    <span id="important-data">information here</span>
  </div>
</div>
...

I come from a background of Java programming and coding with Apache XMLBeans. Is there anything similar to parse HTML, when I know the structure and the data is between a known tag?

Thanks

score 4 · Accepted Answer

これは、Java で書かれたいくつかのスクリーンスクレイピングツールを含む記事です。

一般に、探しているパターンマッチングを行う正規表現を調べたいと思われます。

それが役立つことを願っています!

score 3 · Accepted Answer

Java 用のオープンソース HTML パーサーがいくつかあります。

私は過去にJTidyを使ったことがありますが、うまくいきました。html ページの DOM が提供され、そこから必要なタグを取得できるはずです。

score 0 · Accepted Answer

Java は、このようなタスクにとってかなり難しい制約のように思えます。それは難しい要件ですか？スクリプト言語は、大量のラストマイルコードを構築するのに理想的です。

あなたがそれを受け入れるなら、 ruby + hpricotはそれを完全に簡単にします。css または xpath セレクター (または両方) を使用して、HTML のコンテンツを検索 (および操作) できます。ドキュメントを取得して解析し、例のテキストを抽出することは、文字通り 1 行のコードです。

java - Extracting Information from websites

3 に答える 3

Related

Reference