python - 提案が必要 - Python コードのパフォーマンスの向上

Question

コードのパフォーマンスを改善するためのアドバイスが必要です。

2 つのファイル ( Keyword.txt 、 description.txt ) があります。Keyword.txt はキーワードのリスト (具体的には 11,000 以上) で構成され、descriptions.txt は非常に大きなテキストの説明 (9,000 以上) で構成されています。

keyword.txt からキーワードを 1 つずつ読み取って、説明にキーワードが存在するかどうかを確認しようとしています。キーワードが存在する場合は、新しいファイルに書き込みます。つまり、これは多対多の関係 (11,000 * 9,000) のようなものです。

キーワードの例:

Xerox
VMWARE CLOUD

サンプルの説明 (それは巨大です):

Planning and implementing entire IT Infrastructure. Cyberoam firewall implementation and administration in head office and branch office. Report generation and analysis. Including band width conception, internet traffic and application performance. Windows 2003/2008 Server Domain controller implementation and managing. VERITAS Backup for Clients backup, Daily backup of applications and database. Verify the backed up database for data integrity. Send backup tapes to remote location for safe storage Installing and configuring various network devices; Routers, Modems, Access Points, Wireless ADSL+ modems / Routers Monitoring, managing & optimizing Network. Maintaining Network Infrastructure for various clients. Creating Users and maintaining the Linux Proxy servers for clients. Trouble shooting, diagnosing, isolating & resolving Windows / Network Problems. Configuring CCTV camera, Biometrics attendance machine, Access Control System Kaspersky Internet Security / ESET NOD32

以下は私が書いたコードです：

import csv
import nltk
import re
wr = open(OUTPUTFILENAME,'w')
def match():
    c = 0
    ft = open('DESCRIPTION.TXT','r')
    ky2 = open('KEYWORD.TXT','r')
    reader = csv.reader(ft)
    keywords = []
    keyword_reader2 = csv.reader(ky2)
    for x in keyword_reader2: # Storing all the keywords to a list
        keywords.append(x[1].lower())

    string = ' '
    c = 0
    for row in reader:
        sentence = row[1].lower()
        id = row[0]
        for word in keywords:
            if re.search(r'\b{}\b'.format(re.escape(word.lower())),sentence):
                    string = string + id+'$'+word.lower()+'$'+sentence+ '\n'
                    c = c + 1
        if c > 5000:  # I am writing 5000 lines at a time.
            print("Batch printed")
            c = 0
            wr.write(string)
            string = ' '
    wr.write(string)
    ky2.close()
    ft.close()
    wr.close()

match()

現在、このコードが完了するまでに約 120 分かかります。速度を改善するためにいくつかの方法を試しました。

最初は一度に 1 行ずつ書いていましたが、小さなファイルですべてをメモリに入れる余裕があるため、一度に 5000 行に変更しました。あまり改善が見られませんでした。
すべてを標準出力にプッシュし、コンソールからパイプを使用してすべてをファイルに追加しました。これはさらに遅かった。

コードで何か間違ったことをした可能性があるため、これを行うためのより良い方法があるかどうかを知りたいです。

私の PC の仕様: RAM: 15 GB プロセッサ: i7 第 4 世代

score 2 · Accepted Answer

すべての検索語句が単語全体で構成されている (単語境界で開始/終了する) 場合、単語ツリーへの並列インデックス作成はほぼ同じくらい効率的です。

何かのようなもの

# keep lowercase characters and digits
# keep apostrophes for contractions (isn't, couldn't, etc)
# convert uppercase characters to lowercase
# replace all other printable symbols with spaces
TO_ALPHANUM_LOWER = str.maketrans(
    "ABCDEFGHIJKLMNOPQRSTUVWXYZ'!#$%&()*+,-./:;<=>?@[]^_`{|}~ \t\n\r\x0b\x0c\"\\",
    "abcdefghijklmnopqrstuvwxyz'                                     "
)

def clean(s):
    """
    Convert string `s` to canonical form for searching
    """
    return s.translate(TO_ALPHANUM_LOWER)

class WordTree:
    __slots__ = ["children", "terminal"]

    def __init__(self, phrases=None):
        self.children = {}   # {"word": WordTrie}
        self.terminal = ''   # if end of search phrase, full phrase is stored here
        # preload tree
        if phrases:
            for phrase in phrases:
                self.add_phrase(phrase)

    def add_phrase(self, phrase):
        tree  = self
        words = clean(phrase).split()
        for word in words:
            ch = tree.children
            if word in ch:
                tree = ch[word]
            else:
                tree = ch[word] = WordTree()
        tree.terminal = " ".join(words)

    def inc_search(self, word):
        """
        Search one level deeper into the tree

        Returns
          (None,    ''    )  if word not found
          (subtree, ''    )  if word found but not terminal
          (subtree, phrase)  if word found and completes a search phrase
        """
        ch = self.children
        if word in ch:
            wt = ch[word]
            return wt, wt.terminal
        else:
            return (None, '')

    def parallel_search(self, text):
        """
        Return all search phrases found in text
        """
        found  = []
        fd = found.append
        partials = []
        for word in clean(text).split():
            new_partials = []
            np = new_partials.append
            # new search from root
            wt, phrase = self.inc_search(word)
            if wt:     np(wt)
            if phrase: fd(phrase)
            # continue existing partial matches
            for partial in partials:
                wt, phrase = partial.inc_search(word)
                if wt:     np(wt)
                if phrase: fd(phrase)
            partials = new_partials
        return found

    def tree_repr(self, depth=0, indent="  ", terminal=" *"):
        for word,tree in self.children.items():
            yield indent * depth + word + (terminal if tree.terminal else '')
            yield from tree.tree_repr(depth + 1, indent, terminal)

    def __repr__(self):
        return "\n".join(self.tree_repr())

その後、あなたのプログラムは

import csv

SEARCH_PHRASES = "keywords.csv"
SEARCH_INTO    = "descriptions.csv"
RESULTS        = "results.txt"

# get search phrases, build WordTree
with open(SEARCH_PHRASES) as inf:
    wt = WordTree(*(phrase for _,phrase in csv.reader(inf)))

with open(SEARCH_INTO) as inf, open(RESULTS, "w") as outf:
    # bound methods (save some look-ups)
    find_phrases = wt.parallel_search
    fmt          = "{}${}${}\n".format
    write        = outf.write
    # sentences to search
    for id,sentence in csv.reader(inf):
        # search phrases found
        for found in find_phrases(sentence):
            # store each result
            write(fmt(id, found, sentence))

これは1000 倍高速になるはずです。

score 2 · Accepted Answer

検索を高速化したいのでしょう。その場合、説明内のキーワードの頻度は気にせず、それらが存在することのみを気にする場合は、次のことを試すことができます。

説明ファイルごとに、テキストを個々の単語に分割し、一意の単語のセットを生成します。

次に、キーワードのリスト内の各キーワードについて、セットにキーワードが含まれているかどうかを確認し、true の場合はファイルに書き込みます。

これにより、反復が高速化されるはずです。また、パフォーマンスの問題の一部である可能性が高い正規表現をスキップするのにも役立ちます。

PS: 私のアプローチでは、句読点を除外することを前提としています。

python - 提案が必要 - Python コードのパフォーマンスの向上

2 に答える 2

Related

Reference