python - URLからトップレベルドメイン名（TLD）を抽出する方法

Question

サブドメインを除いて、URLからドメイン名をどのように抽出しますか？

私の最初の単純な試みは次のとおりです。

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

これはhttp://www.foo.comでは機能しますが、http ：//www.foo.com.auでは機能しません。有効なTLD（トップレベルドメイン）または国コード（変更されるため）に関する特別な知識を使用せずに、これを適切に行う方法はありますか？

ありがとう

score 58 · Accepted Answer

これは、この質問を見た後に誰かがこの問題を解決するために書いた素晴らしいPythonモジュールです： https ：//github.com/john-kurkowski/tldextract

このモジュールは、Mozillaボランティアによって管理されているパブリックサフィックスリストでTLDを検索します

引用：

tldextract一方、パブリックサフィックスリストに従って現在生きているものを検索することにより、すべてのgTLD[ジェネリックトップレベルドメイン]およびccTLD[国コードトップレベルドメイン]がどのように見えるかを知っています。したがって、URLが与えられると、そのドメインからサブドメインを認識し、国コードからドメインを認識します。

score 52 · Accepted Answer

いいえ、（たとえば）がサブドメインであることを知る「本質的な」方法はありません（たとえばzap.co.it、イタリアのレジストラはなどのドメインを販売しているためco.it）が、zap.co.uk そうではない（英国のレジストラはなどのドメインを販売してco.ukいないためzap.co.uk）。

補助テーブル（またはオンラインソース）を使用して、どのTLDが英国やオーストラリアのように特別に動作するかを示す必要があります-そのような追加のセマンティック知識なしで文字列を見つめることからそれを区別する方法はありません（もちろんそれは可能です最終的には変更されますが、適切なオンラインソースを見つけることができれば、そのソースもそれに応じて変更されます。

score 43 · Accepted Answer

他の誰かがMozillaのウェブサイトで見つけた効果的なTLDのこのファイルを使用する：

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

結果：

abcde.co.uk

上記のどのビットをよりPython的な方法で書き直すことができるかを誰かに教えてもらえれば幸いです。たとえば、last_i_elementsリストを反復処理するためのより良い方法があるはずですが、私はそれを考えることができませんでした。ValueError育てるのが一番いいのかどうかもわかりません。コメント？

score 36 · Accepted Answer

Pythonを使用するtld

https://pypi.python.org/pypi/tld

インストール

pip install tld

指定されたURLから文字列としてTLD名を取得します

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

またはプロトコルなし

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

TLDをオブジェクトとして取得する

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

指定されたURLから文字列として第1レベルのドメイン名を取得します

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

score 2 · Accepted Answer

多くのTLDがあります。リストは次のとおりです。

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

これが別のリストです

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

これが別のリストです

http://www.iana.org/domains/root/db/

score 0 · Accepted Answer

get_tldがすべての新しいものに対して更新されるまで、エラーからtldをプルします。確かにそれは悪いコードですが、それは機能します。

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

score -1 · Accepted Answer

これが私がそれを処理する方法です：

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

score -1 · Accepted Answer

Pythonでは、解析のようなURLで失敗するまで、tldextractを使用していました。www.mybrand.sa.comsubdomain='order.mybrand', domain='sa', suffix='com'

だから最後に、私はこのメソッドを書くことにしました

重要な注意：これは、サブドメインが含まれているURLでのみ機能します。これは、tldextractのようなより高度なライブラリを置き換えることを意図したものではありません

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

python - URLからトップレベルドメイン名（TLD）を抽出する方法

8 に答える 8

インストール

指定されたURLから文字列としてTLD名を取得します

TLDをオブジェクトとして取得する

指定されたURLから文字列として第1レベルのドメイン名を取得します

Related

Reference