python - Python での URL 解析 - パスの二重スラッシュの正規化

Question

HTML ページで URL (主に HTTP URL) を解析する必要があるアプリに取り組んでいます。入力を制御できず、予想どおり、少し面倒です。

私が頻繁に遭遇する問題の 1 つは、パス部分に二重スラッシュがある URL の解析と結合に関しては、urlparse が非常に厳密である (そしておそらくバグがある?) ことです。たとえば、次のようになります。

testUrl = 'http://www.example.com//path?foo=bar'
urlparse.urljoin(testUrl, 
                 urlparse.urlparse(testUrl).path)

期待される結果の代わりにhttp://www.example.com//path(または正規化された単一のスラッシュを使用するとさらに良い結果になります)、最終的にはhttp://path.

ところで、私がそのようなコードを実行している理由は、URL からクエリ/フラグメント部分を削除する唯一の方法を見つけたからです。もっといい方法があるのかもしれませんが、見つけられませんでした。

誰かがこれを回避する方法を推奨できますか、または(比較的単純な、私が知っている)正規表現を使用して自分でパスを正規化する必要がありますか?

score 5 · Accepted Answer

クエリ部分なしで URL のみを取得したい場合は、urlparse モジュールをスキップして、次のようにします。

testUrl.rsplit('?')

URL は返されるリストのインデックス 0 にあり、クエリはインデックス 1 にあります。

「?」を 2 つ使用することはできません。すべての URL で機能するはずです。

score 5 · Accepted Answer

パス ( //path) だけでは有効ではないため、関数が混乱し、ホスト名として解釈されます。

https://www.rfc-editor.org/rfc/rfc3986.html#section-3.3

URI に機関コンポーネントが含まれていない場合、パスを 2 つのスラッシュ文字 (「//」) で始めることはできません。

これらのソリューションはどちらも特に好きではありませんが、機能します。

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'

parsed = list(urlparse.urlparse(testurl))
parsed[2] = re.sub("/{2,}", "/", parsed[2]) # replace two or more / with one
cleaned = urlparse.urlunparse(parsed)

print cleaned
# http://www.example.com/path?foo=bar

print urlparse.urljoin(
    testurl, 
    urlparse.urlparse(cleaned).path)

# http://www.example.com//path

何をしているかに応じて、参加を手動で行うことができます。

import re
import urlparse

testurl = 'http://www.example.com//path?foo=bar'
parsed = list(urlparse.urlparse(testurl))

newurl = ["" for i in range(6)] # could urlparse another address instead

# Copy first 3 values from
# ['http', 'www.example.com', '//path', '', 'foo=bar', '']
for i in range(3):
    newurl[i] = parsed[i]
    
# Rest are blank
for i in range(4, 6):
    newurl[i] = ''

print urlparse.urlunparse(newurl)
# http://www.example.com//path

score 2 · Accepted Answer

これを試して：

def http_normalize_slashes(url):
    url = str(url)
    segments = url.split('/')
    correct_segments = []
    for segment in segments:
        if segment != '':
            correct_segments.append(segment)
    first_segment = str(correct_segments[0])
    if first_segment.find('http') == -1:
        correct_segments = ['http:'] + correct_segments
    correct_segments[0] = correct_segments[0] + '/'
    normalized_url = '/'.join(correct_segments)
    return normalized_url

URL の例:

print(http_normalize_slashes('http://www.example.com//path?foo=bar'))
print(http_normalize_slashes('http:/www.example.com//path?foo=bar'))
print(http_normalize_slashes('www.example.com//x///c//v///path?foo=bar'))
print(http_normalize_slashes('http://////www.example.com//x///c//v///path?foo=bar'))

戻ります：

http://www.example.com/path?foo=bar
http://www.example.com/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar
http://www.example.com/x/c/v/path?foo=bar

それが役に立てば幸い.. ：）

score 2 · Accepted Answer

公式の urlparse ドキュメントには、次のことが記載されています。

url が絶対 URL (つまり、// または scheme:// で始まる) の場合、URL のホスト名および/またはスキームが結果に表示されます。例えば

urljoin('http://www.cwi.nl/%7Eguido/Python.html',
...         '//www.python.org/%7Eguido')
'http://www.python.org/%7Eguido'

この動作を望まない場合は、urlsplit() と urlunsplit() を使用して URL を前処理し、可能性のあるスキームと netloc 部分を削除します。

だからあなたがすることができます：

urlparse.urljoin(testUrl,
             urlparse.urlparse(testUrl).path.replace('//','/'))

出力 ='http://www.example.com/path'

score 0 · Accepted Answer

0

解決にはなりませんか？

urlparse.urlparse(testUrl).path.replace('//', '/')

于 2012-01-19T12:54:38.293 に答える

python - Python での URL 解析 - パスの二重スラッシュの正規化

8 に答える 8

Related

Reference