python - Python を使用して複数のテキストで単語 (完全一致) を検索する

Question

ユーザーが複数のテキストを選択して開き、テキスト内の完全一致を検索できるようにしたいと考えています。エンコーディングをユニコードにしたい。

「cat」を検索すると、「cat」、「cat、」、「.cat」が検索されますが、「カタログ」は検索されません。

ユーザーがすべてのテキストで同時に 2 つの単語 (「猫」または「犬」) を検索できるようにする方法がわかりません?????? REも使えるかも？

これまでのところ、ユーザーが検索対象のテキストファイルを含むディレクトリへのパスを挿入できるようにしました。次に、ユーザー (raw_input) がすべてのテキストで 2 つの単語を検索できるようにし、出力して出力できるようにします。結果 (例: document1.txt にある "search_word_1" と "search_word_2"、document4.txt にある "search_word_2") を別のドキュメント (search_words) に保存します。

import re, os


path = raw_input("insert path to directory :")
ex_library = os.listdir(path)
search_words = open("sword.txt", "w") # File or maybe list to put in the results
thelist = []

for texts in ex_library:
    f = os.path.join(path, texts)
    text = open(f, "r")
    textname = os.path.basename(texts)
    print textname
    for line in text.read():

    text.close()

score 1 · Accepted Answer

この場合、正規表現が適切なツールです。

「cat」、「cat」、「.cat」を検索したいのですが、「カタログ」は検索したくありません。

パターン：r'\bcat\b'

\b単語境界で一致します。

ユーザーがすべてのテキストで同時に 2 つの単語 (「猫」または「犬」) を検索できるようにする方法

パターン：r'\bcat\b|\bdog\b'

印刷するには"filename: <words that are found in it>":

#!/usr/bin/env python
import os
import re
import sys

def fgrep(words, filenames, encoding='utf-8', case_insensitive=False):
    findwords = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
                           flags=re.I if case_insensitive else 0).findall
    for name in filenames:
        with open(name, 'rb') as file:
             text = file.read().decode(encoding)
             found_words = set(findwords(text))
             yield name, found_words

def main():
    words = [w.decode(sys.stdin.encoding) for w in sys.argv[1].split(",")]
    filenames = sys.argv[2:] # the rest is filenames
    for filename, found_words in fgrep(words, filenames):
        print "%s: %s" % (os.path.basename(filename), ",".join(found_words))

main()

例：

$ python findwords.py 'cat,dog' /path/to/*.txt

代替ソリューション

メモリ内のファイル全体を読み取らないようにするには:

import codecs

...
with codecs.open(name, encoding=encoding) as file:
    found_words = set(w for line in file for w in findwords(line))

見つかった単語を、見つかったコンテキストで印刷することもできます。たとえば、強調表示された単語を含む行を印刷します。

from colorama import init  # pip install colorama
init(strip=not sys.stdout.isatty())  # strip colors if stdout is redirected
from termcolor import colored  # pip install termcolor

highlight = lambda s: colored(s, on_color='on_red', attrs=['bold', 'reverse'])

...
regex = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
                   flags=re.I if case_insensitive else 0)

for line in file:
    if regex.search(line): # line contains words
       line = regex.sub(lambda m: highlight(m.group()), line)
       yield line

score 0 · Accepted Answer

空白と句読点で各ファイルのテキストを分割する必要があります。それが完了したら、残りのリストで検索対象の単語を簡単に探すことができます。大文字と小文字を区別する検索も必要でない限り、すべてを小文字に変換する必要もあります。

python - Python を使用して複数のテキストで単語 (完全一致) を検索する

3 に答える 3

代替ソリューション

Related

Reference