python - Pythonの「範囲外のリストインデックス」

Question

アラビア語を含むテキストファイルにインデックスを付けるためのコードが Python にあります。英語のテキストでコードをテストしたところ、うまく機能しましたが、アラビア語のテキストをテストするとエラーが発生しました。注: テキストファイルは、ANSI エンコーディングではなく、Unicode エンコーディングで保存されます。

これは私のコードです:

from whoosh import fields, index
import os.path
import csv
import codecs
from whoosh.qparser import QueryParser

# This list associates a name with each position in a row
columns = ["juza","chapter","verse","voc"]

schema = fields.Schema(juza=fields.NUMERIC,
                       chapter=fields.NUMERIC,
                       verse=fields.NUMERIC,
                       voc=fields.TEXT)

# Create the Whoosh index
indexname = "indexdir"
if not os.path.exists(indexname):
  os.mkdir(indexname)
ix = index.create_in(indexname, schema)

# Open a writer for the index
with ix.writer() as writer:
  with open("h.txt", 'r') as txtfile:
    lines=txtfile.readlines()

    # Read each row in the file
    for i in lines:

      # Create a dictionary to hold the document values for this row
      doc = {}
      thisline=i.split()
      u=0

      # Read the values for the row enumerated like
      # (0, "juza"), (1, "chapter"), etc.
      for w in thisline: 
        # Get the field name from the "columns" list
          fieldname = columns[u]
          u+=1
          #if isinstance(w, basestring):
          #     w = unicode(w)
          doc[fieldname] = w
      # Pass the dictionary to the add_document method
      writer.add_document(**doc)
with ix.searcher() as searcher:
    query = QueryParser("voc", ix.schema).parse(u"بسم")
    results = searcher.search(query)
    print(len(results))
    print(results[1])

エラーは次のとおりです。

Traceback (most recent call last):
  File "C:\Python27\yarab.py", line 38, in <module>
    fieldname = columns[u]
IndexError: list index out of range

これはファイルのサンプルです:

1   1   1   كتاب
1   1   2   قرأ
1   1   3   لعب
1   1   4   كتاب

score 0 · Accepted Answer

スクリプトに Unicode のヘッダーがありません。最初の行は次のようになります。

エンコーディング: utf-8

また、Unicode エンコーディングを使用してファイルを開くには、次のようにします。

import codecs 
with codecs.open("s.txt",encoding='utf-8') as txtfile:

score 0 · Accepted Answer

明らかに間違っていることはわかりませんが、エラーを考慮して設計していることを確認してください。split() が予想以上の量の要素を返す状況を把握し、すぐに処理するようにしてください (例: 印刷して終了する)。不適切な形式のデータを扱っているようです。

python - Pythonの「範囲外のリストインデックス」

2 に答える 2

エンコーディング: utf-8

Related

Reference