python - Python のテキストコンテンツアナライザー

翻译自：https://stackoverflow.com/questions/33976069 2015-11-28T20:24:05.847

1367 次

ファイルからの入力と出力を分析するテキストコンテンツアナライザーをPythonで作成しました

総単語数
ユニークワード数
文章数

コードは次のとおりです。

import re
import string
import os
import sys

def function(s):
    return re.sub("[%s]" % re.escape(string.punctuation), '', s.lower())

def main():
    words_list = []

    with open(sys.argv[1], "r") as f:
        for line in f:
            words_list.extend(line.split())

    print "Total word count:", len(words_list)

    new_words = map(function, words_list)

    print "Unique words:", len(set(new_words))

    nb_sentence = 0
    for word in words_list:
        if re.search(r'[.!?][' "'" '"]*', word):
            nb_sentence += 1

    print "Sentences:", nb_sentence

if __name__ == "__main__":
    main()

現在、単語単位で平均文長を計算し、よく使われる語句（3回以上使われている3語以上の語句）を見つけ、使用頻度の高い順に単語のリストを作成しようとしています。誰でも助けてもらえますか？

python - Python のテキスト コンテンツ アナライザー

1 に答える 1

Related

Reference

python - Python のテキストコンテンツアナライザー