java - Jsoup を使用して、各リンクに存在するすべての情報を取得するにはどうすればよいですか?

Question

     package com.muthu;
     import java.io.IOException;
     import org.jsoup.Jsoup;
     import org.jsoup.helper.Validate;
     import org.jsoup.nodes.Document;
     import org.jsoup.nodes.Element;
     import org.jsoup.select.Elements;
     import org.jsoup.select.NodeVisitor;
     import java.io.BufferedWriter;
     import java.io.File;
     import java.io.FileWriter;
     import java.io.IOException;
     import org.jsoup.nodes.*;
     public class TestingTool 
     {
        public static void main(String[] args) throws IOException
        {
    Validate.isTrue(args.length == 0, "usage: supply url to fetch");
            String url = "http://www.stackoverflow.com/";
            print("Fetching %s...", url);
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            System.out.println(doc.text());
            Elements tags=doc.getElementsByTag("div");
            String alls=doc.text();
            System.out.println("\n");
            for (Element link : links)
            {
        print("  %s  ", link.attr("abs:href"), trim(link.text(), 35));
            }
            BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool                 
            /linknames.txt")));        
         for (Element link : links) {
            bw.write("Link: "+ link.text().trim());
        bw.write(System.getProperty("line.separator"));       
       }    
      bw.flush();     
      bw.close();
    }           }
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
        }

score 3 · Accepted Answer

URL に接続すると、現在のページのみが解析されます。ただし、1.) URL に接続する、2.) 必要な情報を解析する、3.) 以降のすべてのリンクを選択する、4.) それらに接続する、5.) 新しいリンクがある限りこれを続けることができます。

考慮事項:

すでに解析したリンクを保存したリスト (?) またはその他のものが必要です
このページのリンクだけが必要なのか、それとも外部も必要なのかを決定する必要があります
「概要」、「連絡先」などのページをスキップする必要があります。

編集：（
注：いくつかの変更/エラー処理コードを追加する必要があります）

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

次のリンクが選択されている部分に、さらに制限/チェックを追加する必要があります (スキップ/無視したい場合があります)。そしていくつかのエラー処理。

編集2：

無視されたリンクをスキップするには、これを使用できます。

無視されたキーワードを保存するセット/リスト/何でも作成します
それらのキーワードを入力してください
解析する新しいリンクでメソッドを呼び出す前にvisitUrl()、この新しい URL に無視されたキーワードが含まれているかどうかを確認します。少なくとも 1 つ含まれている場合はスキップされます。

そのために例を少し変更しました (ただし、まだテストされていません!)。

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore

// ...


/*
 * Add keywords to the ignorelist. Each link that contains one of this
 * words will be skipped.
 * 
 * Do this in eg. constructor, static block or a init method.
 */
ignore.add(".twitter.com");

// ...


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // Now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            boolean skip = false; // If false: parse the url, if true: skip it
            final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse

            for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
            {
                if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                {
                    skip = true;
                    break;
                }
            }

            if( !skip )
                visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

次のリンクの解析は次のようにして行われます:

final String href = next.absUrl("href");
/* ... */
visitUrl(next.absUrl("href"));

ただし、この部分にさらに停止条件を追加する必要があるかもしれません。

java - Jsoup を使用して、各リンクに存在するすべての情報を取得するにはどうすればよいですか?

1 に答える 1

Related

Reference