java - URL ハーベスターの文字列操作

Question

私は再帰的な url の収集を行っています。ソース内に "http" で始まらないリンクが見つかったら、それを現在の url に追加します。問題は、動的サイトに出くわしたときに、通常、http のないリンクが現在の URL の新しいパラメーターであることです。たとえば、現在の URL がhttp://www.somewebapp.com/default.aspx?pageid=4088のようなもので、そのページのソースに default.aspx?pageid=2111 というリンクがあるとします。この場合、文字列操作を行う必要があります。これは私が助けを必要とするところです。
疑似コード:

if part of the link found is a contains a substring of the current url
      save the substring            
      save the unique part of the link found
replace whatever is after the substring in the current url with the unique saved part

これは Java ではどのように見えるでしょうか? これを別の方法で行うためのアイデアはありますか？ありがとう。

コメントによると、これが私が試したことです：

if (!matched.startsWith("http")) {
    String[] splitted = url.toString().split("/");
    java.lang.String endOfURL = splitted[splitted.length-1];
    boolean b = false;
    while (!b && endOfURL.length() > 5) { // f.bar shortest val
        endOfURL = endOfURL.substring(0, endOfURL.length()-2);
        if (matched.contains(endOfURL)) {
            matched = matched.substring(endOfURL.length()-1);
            matched = url.toString().substring(url.toString().length() - matched.length()) + matched;
            b = true;
        }
    }

うまくいかない..

score 1 · Accepted Answer

あなたはこれを間違った方法でやっていると思います。Java には 2 つのクラスがURLありURI、「文字列バッシング」ソリューションよりもはるかに正確に URL/URL 文字列を解析できます。たとえば、URL コンストラクターは、既存のオブジェクトのコンテキストでURL(URL, String)新しいURLオブジェクトを作成します。文字列が絶対 URL か相対 URL かを気にする必要はありません。次のように使用します。

URL currentPageUrl = ...
String linkUrlString = ...

// (Exception handling not included ...)
URL linkUrl = new URL(currentPageUrl, linkUrlString);

java - URL ハーベスターの文字列操作

1 に答える 1

Related

Reference