java - JavaでのWebページのダウンロードに問題がありますか？

Question

だから私はJavaでaspxウェブページ（Roblox）のテキストをダウンロードしようとしています。私のコードは次のようになります：

URL url;
InputStream is = null;
DataInputStream dis;
String line = "";
try {
    System.out.println("connecting");
    url = new URL("http://www.roblox.com");
    is = url.openStream();  // throws an IOException
    dis = new DataInputStream(new BufferedInputStream(is));

    while ((line = dis.readLine()) != null) {
        System.out.println(line);
    }
} catch (Exception ex) {
    ex.printStackTrace();
} finally {
    try {
        is.close();
    } catch (IOException ioe) {}
}

そしてそれはwww.roblox.comのために働きます。ただし、別のページ（http://www.roblox.com/My/Money.aspx#/#TradeCurrency_tab ）に移動しようとすると、機能せず、www.roblox.com画面が読み込まれるだけです。

誰かがこれを明確にするのを手伝ってもらえますか？どんな助けでもいただければ幸いです。

score 0 · Accepted Answer

サーバーが応答に次のヘッダーを追加するため、Javaではブラウザーに表示されるものとは異なるコンテンツを取得しています。

Location=https://www.roblox.com/Login/Default.aspx?ReturnUrl=%2fMy%2fMoney.aspx

URLConnectionからヘッダーの値を取得し、「Location」ヘッダーが存在する場合は手動でリダイレクトする必要があります。私の知る限り、HttpConnectionを使用したとしても、自動的に「https」にリダイレクトされることはありません。

編集：

このようにsmthでそれを行うことができます（リダイレクトに焦点を合わせるためだけに例外処理などの他のコードを削除したので、適切な「コーディング」の例とは見なさないでください）：

public static void main(String[] args) throws Exception {
    printPage("http://www.roblox.com/My/Money.aspx#/#TradeCurrency_tab");       
}

public static void printPage(String address) throws Exception {     
    String line = null;
    System.out.println("connecting to:" + address);
    URL url = new URL(address);
    URLConnection conn = url.openConnection();
    String redirectAdress = conn.getHeaderField("Location");
    if (redirectAdress != null) {
        printPage(redirectAdress);
    } else {
        InputStream is = url.openStream(); 
        DataInputStream dis = new DataInputStream(new BufferedInputStream(is));
        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    }
}

score 0 · Accepted Answer

URL と # の使用から判断すると、このページは JavaScript を使用してページを動的に作成していると思われます。

http://seleniumhq.org/のようなものを使用して、Web ブラウザー (Cookie を含む) をエミュレートできます。これは、あらゆる種類の動的 Web コンテンツに対してはるかに信頼性の高いアプローチです。

    // The Firefox driver supports javascript 
    WebDriver driver = new FirefoxDriver();

    // Go to the roblox page
    driver.get("http://www.roblox.com");

    System.out.println(driver.getPageSource());

もちろん、Selenium の WebDriver API を介してページの要素にアクセスするには、もっと良い方法がたくさんあります: http://selenium.googlecode.com/svn/trunk/docs/api/java/org/openqa/selenium/WebDriver.html

JAR とすべての deps を 1 つのファイルにダウンロードします: http://code.google.com/p/selenium/downloads/detail?name=selenium-server-standalone-2.27.0.jar

また、コードを使用して他のページに移動できることに注意してください: http://seleniumhq.org/docs/03_webdriver.html -

     WebElement link = driver.findElement(By.linkText("Click Here Or Whatever"));
     link.click();

それから

     System.out.println(driver.getPageSource());

次のページのページテキストを取得します。

java - JavaでのWebページのダウンロードに問題がありますか？

2 に答える 2

Related

Reference