java - Javaでウェブサイトのソースコードを読む方法

Question

Java で Web クローラーを作成しようとしていますが、これまでのところ、JavaScript または PHP を使用してコンテンツを動的に取得する Web サイトに問題がありますが、これはほとんど機能します。たとえば、ソースコード全体を取得するのではなく、tumblr ブログをクロールしようとすると、リンクとすべてでは、CSS とヘッダー情報のみを取得します。これは、すべての投稿情報が JavaScript によって収集されるためです。

Web ページからソースコードを取得するために使用しているコードは ...

public static String openURL( String url )
{
    String source = null;                                                                           
    String temp = "";                                                                       
    BufferedInputStream bis;                                                                        
    try
    {
        URL my_url = new URL(url);                                                          

        HttpURLConnection urlConnection = (HttpURLConnection) my_url.openConnection();
        urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0");
        InputStream is = urlConnection.getInputStream();

        bis = new BufferedInputStream(is);                  

        byte[] buffer = new byte[1024];
        int bytesread = 0;
        source = "";

        bytesread = bis.read(buffer);

        while( bytesread != -1 )
        {
            source += new String(buffer, 0, bytesread);
            bytesread = bis.read(buffer);
        }
    }
    catch (Exception ex ){}
    System.out.println(source);
    return source;                                                                              
}

これを変更して動的コンテンツを取得する方法を教えてください。どんな助けでも大歓迎です

乾杯ダニエル

編集：あなたの答えは役に立ちますが、申し訳ありませんが、このプロジェクトはより教育的であるため、サードパーティのAPIを使用せずにそれを行う方法を見つけようとしていました

score 2 · Accepted Answer

Generally web crawlers will see websites without the javascript having been processed. Web developers know this and so "good" websites can be read successfully without JS

IF you really really want to actually process the JS, (although your life is going to be a lot easier if you dont), you can use this tool: http://phantomjs.org/

I havent actually used it but it allows you to process the JS without using a browser..

score 0 · Accepted Answer

Java で実行したい場合は、 javascript を処理できるhtmlunit、または実際のブラウザーを駆動するのに役立つseleniumをご覧ください。

java - Javaでウェブサイトのソースコードを読む方法

2 に答える 2

Related

Reference