python - Pythonで文字列を個々の単語に分割する

Question

ドメイン名のリストがたくさんあり（約6000）、ポートフォリオの大まかな概要として、どの単語が最も高い傾向にあるかを確認したいと思います。

私が抱えている問題は、リストがドメイン名としてフォーマットされていることです。次に例を示します。

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

単語数を実行するだけでゴミが発生します。したがって、これを実行する最も簡単な方法は、単語全体の間にスペースを挿入してから単語数を実行することだと思います。

私の正気のために、私はこれをスクリプト化することを好みます。

私は（非常に）小さなPython 2.7を知っていますが、これに取り組む際の推奨事項を受け入れています。コードの例が本当に役立ちます。単純な文字列トライデータ構造を使用することがこれを達成する最も簡単な方法であると言われましたが、Pythonでこれを実装する方法がわかりません。

score 6 · Accepted Answer

ドメイン名（s）を、既知の単語のセット（）から任意の数の単語（2つだけではない）に分割しようとしますwords。再帰ftw！

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

このイテレータ関数は、最初に、がにある場合に呼び出される文字列を生成しますwords。次に、文字列をあらゆる方法で2つに分割します。最初の部分がにない場合はwords、次の分割を試みます。そうである場合、最初の部分は、2番目の部分で自分自身を呼び出したすべての結果の前に付加されます（["example"、 "cart"、...]のように、何もない場合があります）

次に、英語の辞書を作成します。

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

これで、物事をまとめることができます。

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

結果：{'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

を使用しsetて英語の辞書を含めると、メンバーシップのチェックが高速になります。-=セットからアイテムを削除し、|=追加します。

関数をジェネレータ式allと一緒に使用すると、最初のに戻るため、効率が向上します。allFalse

一部の部分文字列は、「example」/「ex」+「ample」のように、全体または分割の両方で有効な単語である可能性があります。場合によっては、上記のコード例の「ex」などの不要な単語を除外することで問題を解決できます。「年金」/「ペン」+「イオン」のような他の人にとっては、それは避けられないかもしれません、そしてこれが起こるとき、私たちは文字列の他のすべての単語が複数回カウントされないようにする必要があります（「年金」に対して1回と1回）「ペン」+「イオン」の場合）。これを行うには、セット内の各ドメイン名で見つかった単語を追跡します（セットは重複を無視します）。次に、すべてが見つかったら単語をカウントします。

編集：再構築され、多くのコメントが追加されました。大文字と小文字によるミスを避けるために、文字列を小文字に強制しました。また、単語の組み合わせが一致しなかったドメイン名を追跡するためのリストを追加しました。

NECROMANCY EDIT：拡張性が向上するように部分文字列関数を変更しました。古いバージョンは、16文字程度より長いドメイン名では途方もなく遅くなりました。上記の4つのドメイン名だけを使用して、自分の実行時間を3.6秒から0.2秒に改善しました。

score 1 · Accepted Answer

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

これは、ドメインを2つの英語の単語に分割しようとするブルートフォース方式です。ドメインが2つの英語の単語に分割されていない場合、それはジャンクされます。これを拡張してより多くの分割を試みるのは簡単なはずですが、賢くない限り、分割の数に応じてうまくスケーリングできない可能性があります。幸いなことに、必要なのは最大3つまたは4つの分割だけだと思います。

出力：

deals: 1
example: 2
pensions: 1

score 1 · Accepted Answer

標準ドメインが数千しかない場合、これはすべてメモリ内で実行できるはずです。

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

python - Pythonで文字列を個々の単語に分割する

3 に答える 3

Related

Reference