python - Pythonでフレーズをカウントし、そのフレーズをヘッダーとして使用するにはどうすればよいですか？

Question

フレーズの数を取得しようとしているファイルがあります。特定のテキスト行で数える必要のあるフレーズは約100個あります。簡単な例として、次のようなものがあります。

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

header = ''
for phrase in phrases.splitlines():
    header = header+'|'+phrase
header = 'id'+header

次のような出力ができるようにしたいと思います。

id|hello|name|john doe
1|3|1|1
2|0|1|1

ヘッダーが下にあります。各フレーズを数え、出力を追加する方法がわかりません。

score 3 · Accepted Answer

ヘッダーのリストを作成します

In [6]: p=phrases.strip().split('\n')

In [7]: p
Out[7]: ['hello', 'name', 'john doe']

単語境界を使用する正規表現を使用します。つまり\b、部分一致を回避するオカレンスの数を取得します。フラグre.Iは、検索で大文字と小文字を区別しないようにすることです。

In [11]: import re

In [14]: re.findall(r'\b%s\b' % p[0], text1)
Out[14]: ['hello', 'hello', 'hello']

In [15]: re.findall(r'\b%s\b' % p[0], text1, re.I)
Out[15]: ['hello', 'hello', 'hello']

In [16]: re.findall(r'\b%s\b' % p[1], text1, re.I)
Out[16]: ['name']

In [17]: re.findall(r'\b%s\b' % p[2], text1, re.I)
Out[17]: ['john doe']

len()見つかったパターンの数を取得するには、その周りに配置します。

score 2 · Accepted Answer

を使用して文字列内の単語を数えることができます.count()

>>> text1.lower().count('hello')
3

したがって、これは機能するはずです（以下のコメントに記載されている不一致は別として）

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

texts = [text1,text2]

header = ''
for phrase in phrases.splitlines():
    header = header+'|'+phrase
header = 'id'+header
print header

for id,text in enumerate(texts):
    textcount = [id]
    for phrase in header.split('|')[1:]:
        textcount.append(text.lower().count(phrase))
    print "|".join(map(str,textcount))

上記は、テキストのリストがid'の順にあることを前提としていますが、すべてがで始まる場合は、次のように'id=n'することができます。

for text in texts:
    id = text[3]  # assumes id is 4th char
    textcount = [id]

score 0 · Accepted Answer

それはあなたの質問に答えませんが（@askewchanと@Fredrikはそれをしました）、私はあなたのアプローチの残りについていくつかのアドバイスを提供すると思いました：

リストでフレーズを定義することで、より良いサービスが提供される可能性があります。

phrases = ['hello', 'name', 'john doe']

これにより、ヘッダーを作成する際のループをスキップできます。

header = 'id|' + '|'.join (phrases)

そして.split ('|')[1:] 、例えば、アスキューチャンの答えの一部を省略して、for phrase in phrases:

score 0 · Accepted Answer

phrases = """hello
name
john doe
"""

text1 = 'id=1: hello my name is john doe.  hello hello.  how are you?'
text2 = 'id=2: I am good.  My name is Jane.  Nice to meet you John Doe'

import re
import collections

txts = [text1, text2]
phrase_list = phrases.split()
print "id|%s" % "|".join([ p for p in phrase_list])
for txt in txts:
    (tid, rest) = re.match("id=(\d):\s*(.*)", txt).groups()

    counter = collections.Counter(re.findall("\w+", rest))
    print "%s|%s" % ( tid, "|".join([str(counter.get(p, 0)) for p in phrase_list]))

与える：

id|hello|name|john|doe
1|3|1|1|1
2|0|1|0|0

python - Pythonでフレーズをカウントし、そのフレーズをヘッダーとして使用するにはどうすればよいですか？

4 に答える 4

Related

Reference