python - CSVファイルの語彙サイズ

Question

次のようなCSVファイルがあります。

Lorem ipsum dolor sit amet , 12:01
consectetuer adipiscing elit, sed , 12:02

等...

かなり大きなファイル（約10,000行）です。テキストのすべての行の合計語彙サイズをまとめたいと思います。つまり、2番目の列（時間）を無視し、すべてを小文字にしてから、異なる単語の数を数えます。

問題：1）各行内の各単語を区切る方法2）すべてを小文字にし、アルファベット以外の文字を削除する方法。

これまでのところ、次のコードがあります。

import csv
with open('/Users/file.csv', 'rb') as file:
    vocabulary = []
    i = 0
    reader = csv.reader(file, delimiter=',')
    for row in reader:
        for word in row:
            if row in vocabulary:
                break
            else:
                vocabulary.append(word)
                i = i +1
print i

ご協力ありがとうございました！

score 3 · Accepted Answer

Python csvモジュールは、提供されているすばらしいライブラリですが、単純なタスクに使用するのはやり過ぎかもしれません。この特定のケースは、私にとっては典型的な例であり、csvモジュールを使用すると事態が複雑になりすぎる可能性があります

私に、

ファイルを反復処理するだけで、
各行をコンマで分割し、最初の分割を抽出します
次に、残りの部分を空白で分割します
各単語を小文字に変換する
すべての句読点と数字を取り除きます
そして結果をセットとして理解する

線形の単純なアプローチです

次のファイルコンテンツで実行される例

Lorem Ipsum is simply dummy "text" of the ,0
printing and typesetting; industry. Lorem,1
 Ipsum has been the industry's standard ,2
dummy text ever since the 1500s, when an,3
 unknown printer took a galley of type and,4
 scrambled it to make a type specimen ,5
book. It has survived not only five ,6
centuries, but also the leap into electronic,7
typesetting, remaining essentially unch,8
anged. It was popularised in the 1960s with ,9
the release of Letraset sheets conta,10
ining Lorem Ipsum passages, and more rec,11
ently with desktop publishing software like,12
 !!Aldus PageMaker!! including versions of,13
Lorem Ipsum.,14

>>> from string import digits, punctuation
>>> remove_set = digits + punctuation
>>> with open("test.csv") as fin:
    words = {word.lower().strip(remove_set) for line in fin
         for word in line.rsplit(",",1)[0].split()}


>>> words
set(['and', 'pagemaker', 'passages', 'sheets', 'galley', 'text', 'is', 'in', 'it', 'anged', 'an', 'simply', 'type', 'electronic', 'was', 'publishing', 'also', 'unknown', 'make', 'since', 'when', 'scrambled', 'been', 'desktop', 'to', 'only', 'book', 'typesetting', 'rec', "industry's", 'has', 'ever', 'into', 'more', 'printer', 'centuries', 'dummy', 'with', 'specimen', 'took', 'but', 'standard', 'five', 'survived', 'leap', 'not', 'lorem', 'a', 'ipsum', 'essentially', 'unch', 'conta', 'like', 'ining', 'versions', 'of', 'industry', 'ently', 'remaining', 's', 'printing', 'letraset', 'popularised', 'release', 'including', 'the', 'aldus', 'software'])

score 1 · Accepted Answer

必要なものはほとんどあります。欠けている点の1つは小文字変換です。これは、で簡単に実行できますword.lower()。

あなたが見逃しているもう一つのことは、言葉に分割することです。このタスクには使用する必要があります.split()。このタスクは、デフォルトですべての空白文字、つまりスペース、タブなどで分割されます。

問題の1つは、テキスト内のコンマと列区切りのコンマを区別することです。おそらくcsv-readerを使用せずに、各行を読み取って時間を削除し、それを単語に分割するだけです。

import re

with open('/Users/file.csv', 'rb') as file:
    for line in file:
        line = re.sub(" , [0-2][0-9]:[0-5][0-9]", "", line)
        line = re.sub("[,|!|.|?|\"]", "", line)
        words = [w.lower() for w in line.split()]
        for word in words:
            ...

他の文字を削除する場合は、それらを2番目の正規表現に含めます。パフォーマンスが重要な場合は、forループの前に2つの正規表現を1回コンパイルする必要があります。

python - CSVファイルの語彙サイズ

2 に答える 2

Related

Reference