python - 大きな文書から電子メールの部分文字列を抽出する

Question

非常に大きな .txt ファイルがあり、数十万の電子メールアドレスが散らばっています。それらはすべて次の形式を取ります。

...<name@domain.com>...

特定の@domain文字列のすべてのインスタンスを探して.txtファイル全体を循環し、<...>内のアドレス全体を取得して追加する最善の方法は何ですか?リスト？私が抱えている問題は、さまざまなアドレスの可変長にあります。

score 11 · Accepted Answer

以下を使用して、テキスト内のすべての電子メールアドレスを検索し、それらを配列に出力するか、各電子メールを別の行に出力することもできます。

import re
line = "why people don't know what regex are? let me know asdfal2@als.com, Users1@gmail.de " \
       "Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"
match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
for i in match:
    print(i)

リストに追加したい場合は、「一致」を印刷するだけです

# this will print the list
    print(match)

score 8 · Accepted Answer

import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)

この悪名高い正規表現を試したことで私を嫌わないでください。正規表現は、以下に示す電子メールアドレスのかなりの部分で機能します。私は主にこれをメールアドレスの有効な文字の基礎として使用しました。

ここで自由に遊んでみてください

正規表現が次のようなメールをキャプチャするバリエーションも作成しましたname at example.com

(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])

score 4 · Accepted Answer

特定のドメインを探している場合:

>>> import re
>>> text = "this is an email la@test.com, it will be matched, x@y.com will not, and test@test.com will"
>>> match = re.findall(r'[\w-\._\+%]+@test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
>>> match
['la@test.com', 'test@test.com']

score -1 · Accepted Answer

emailregex.comからの正規表現を使用した、この特定の問題に対する別のアプローチを次に示します。

text = "blabla <hello@world.com>><123@123.at> <huhu@fake> bla bla <myname@some-domain.pt>"

# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text)  # ['<hello@world.com>', '<123@123.at>', '<huhu@fake>', '<myname@somedomain.edu>']

# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails   # ['hello@world.com', '123@123.at', 'myname@some-domain.pt']

python - 大きな文書から電子メールの部分文字列を抽出する

13 に答える 13

Related

Reference