java - html から詳細を取得するにはどうすればよいですか?

Question

選択した Web サイトから HTML を出力する Java コードがいくつかあります。次のような HTML コードで特定の日付のみを出力したいと思います。

<tr class="bgWhite">
  <td align="center" width="50"><nobr>GD&#160;</nobr></td>
  <td align="center">Q3&#160;2012</td>

  <td align="left" width="*">Q3 2012 General Dynamics Earnings Release</td>
  <td align="center">$ 1.83&#160;</td>
  <td align="center">n/a&#160;</td>
  <td align="center">$ 1.83&#160;</td>
  <td align="center"><nobr>24-Oct-12</nobr></td>
</tr>
<tr class="bgWhite">
  <td align="center" width="50"><nobr>GD&#160;</nobr></td>
  <td align="center">Q2&#160;2012</td>

  <td align="left" width="*">Q2 2012 General Dynamics Earnings Release</td>
  <td align="center">$ 1.75&#160;</td>
  <td align="center">n/a&#160;</td>
  <td align="center">$ 1.79&#160;</td>
  <td align="center"><nobr>25-Jul-12 BMO</nobr></td>
</tr>

だから私はそれを印刷したいだけです：24-Oct-12 25-Jul-12

それ、どうやったら出来るの？

これが私が持っているコードです:

String nextLine;
URL url = null;
URLConnection urlConn = null;
InputStreamReader  inStream = null;
BufferedReader buff = null;

try{
    // Create the URL obect that points
    // at the default file index.html
    url  = new URL("http://www.earnings.com/company.asp?client=cb&ticker=gd");
    urlConn = url.openConnection();
    inStream = new InputStreamReader( 
                       urlConn.getInputStream());
    buff= new BufferedReader(inStream);

    // Read and print the lines from index.html
    while (true){
        nextLine =buff.readLine();  
        if (nextLine !=null){
            System.out.println(nextLine); 
        }
        else{
           break;
        } 
    }
 } catch(MalformedURLException e){
   System.out.println("Please check the URL:" + 
                                       e.toString() );
 } catch(IOException  e1){
  System.out.println("Can't read  from the Internet: "+ 
                                      e1.toString() ); 
}

score 3 · Accepted Answer

低レベルよりも、完全に価値のあるHTMLパーサーをジョブに使用する方が簡単ですjava.net.URLConnection。ただし、対象のWebサイトは完全に非セマンティックHTML（平均的な90年代のWebサイトのように（yuck）のように、セマンティック識別子/クラスのないすべてのテーブル）を生成するため、適切に解析するのは難しいHTMLパーサーでもあります。とにかく、これがJsoupを使用した完全なキックオフの例で、必要な情報を正確に出力します。

Document document = Jsoup.connect("http://www.earnings.com/company.asp?client=cb&ticker=gd").get();
Elements dateColumn = document.select("table:eq(0) tr:eq(0) table:eq(7) tr:eq(2) table:eq(4) td:eq(6):not(.dataHdrText02)");

for (Element dateCell : dateColumn) {
    System.out.println(dateCell.text());
}

それで全部です。低レベルjava.net.URLConnectionまたは冗長なSAXパーサーに煩わされる必要はありません。

参照：

主要なJavaHTMLパーサーの長所と短所は何ですか？

score 1 · Accepted Answer

これは SAX パーサーの標準 UC だと思います。行ごとに進むべきではありません (現在のように html 文書が常に編成されているとは期待できないため、SAX パーサーを使用する方がより柔軟なソリューションになります)。

ドキュメントのサイズに関する情報があり、サイズが大きくならないことがわかっている場合は、DOM パーサーを使用することもできます。しかし、この観点からも SAX パーサーの方が優れています。

java - html から詳細を取得するにはどうすればよいですか?

2 に答える 2

参照：

Related

Reference