python - Python のテキスト行から可変数の文字を分割または削除しますか?

Question

このタイプの大量のデータがあります。

  array(14) {
    ["ap_id"]=>
    string(5) "22755"
    ["user_id"]=>
    string(4) "8872"
    ["exam_type"]=>
    string(32) "PV Technical Sales Certification"
    ["cert_no"]=>
    string(12) "PVTS081112-2"
    ["explevel"]=>
    string(1) "0"
    ["public_state"]=>
    string(2) "NY"
    ["public_zip"]=>
    string(5) "11790"
    ["email"]=>
    string(19) "ivorabey@zeroeh.com"
    ["full_name"]=>
    string(15) "Ivor Abeysekera"
    ["org_name"]=>
    string(21) "Zero Energy Homes LLC"
    ["org_website"]=>
    string(14) "www.zeroeh.com"
    ["city"]=>
    string(11) "Stony Brook"
    ["state"]=>
    string(2) "NY"
    ["zip"]=>
    string(5) "11790"
  }

ファイルを読み取り、各配列の辞書を作成し、次のように要素を格納する for ループを Python で作成しました。

a = 0
data = [{}]

with open( "mess.txt" ) as messy:
        lines = messy.readlines()
        for i in range( 1, len(lines) ):
            line = lines[i]
            if "public_state" in line:
                data[a]['state'] = lines[i + 1]
            elif "public_zip" in line:
                data[a]['zip'] = lines[i + 1]
            elif "email" in line:
                data[a]['email'] = lines[i + 1]
            elif "full_name" in line:
                data[a]['contact'] = lines[i + 1]
            elif "org_name" in line:
                data[a]['name'] = lines[i + 1]
            elif "org_website" in line:
                data[a]['website'] = lines[i + 1]
            elif "city" in line:
                data[a]['city'] = lines[i + 1]
            elif "}" in line:
                a += 1
                data.append({})

私のコードがひどいものであることはわかっていますが、Python についてはまったくの初心者です。ご覧のとおり、私のプロジェクトの大部分は完了しています。残っているのは、実際のデータからコードタグを取り除くことです。たとえば、私はstring(15) "Ivor Abeysekera"なる必要がありIvor Abeysekera"ます。

いろいろ調べて考え.lstrip()たのですが、前の文章がいつも違うので..行き詰りました。

誰かがこの問題を解決する賢い方法を持っていますか? 乾杯！

編集: Windows 7 で Python 2.7 を使用しています。

score 1 · Accepted Answer

これには正規表現 (regex) を使用する必要があります: http://docs.python.org/2/library/re.html

やりたいことは、次のコードで簡単に実行できます。

# Import the library
import re

# This is a string just to demonstrate
a = 'string(32) "PV Technical Sales Certification"'

# Create the regex
p = re.compile('[^"]+"(.*)"$')

# Find a match
m = p.match(a)

# Your result will be now in s
s = m.group(1)

お役に立てれば！

score 0 · Accepted Answer

すべての行をループし、ブロック内のどこにいるかを追跡することで、これをステートフルに行うことができます。

# Make field names to dict keys
fields = {
    'public_state': 'state',
    'public_zip': 'zip',
    'email': 'email',
    'full_name': 'contact',
    'org_name': 'name',
    'org_website': 'website',
    'city': 'city',
}

data = []
current = {}
key = None
with open( "mess.txt" ) as messy:
    for line in messy.split('\n'):
        line = line.lstrip()
        if line.startswith('}'):
            data.append(current)
            current = {}
        elif line.startswith('['):
            keyname = line.split('"')[1]
            key = fields.get(keyname)
        elif key is not None:
            # Get everything betweeen the first and last quotes on the line
            value = line.split('"', 1)[1].rsplit('"', 1)[0]
            current[key] = value

これにより、ファイル内の位置を追跡する必要がなくなり、一度にすべてをメモリにロードする必要なく、(各レコードの後に辞書を処理する場合) 膨大なデータファイルで作業できることも意味します。実際、一度にデータのブロックを処理し、操作する辞書を生成するジェネレーターとして再構築しましょう。

fields = {
    'public_state': 'state',
    'public_zip': 'zip',
    'email': 'email',
    'full_name': 'contact',
    'org_name': 'name',
    'org_website': 'website',
    'city': 'city',
}

def dict_maker(fileobj):
    current = {}
    key = None
    for line in fileobj:
        line = line.lstrip()
        if line.startswith('}'):
            yield current
            current = {}
        elif line.startswith('['):
            keyname = line.split('"')[1]
            key = fields.get(keyname)
        elif key is not None:
            # Get everything betweeen the first and last quotes on the line
            value = line.split('"', 1)[1].rsplit('"', 1)[0]
            current[key] = value

with open("mess.txt") as messy:
    for d in dict_maker(messy):
        print d

これにより、メインループが小さくてわかりやすいものになります。膨大になる可能性がある dict のセットを一度に 1 つずつループし、それらを使って何かを行います。辞書を作成する行為とそれらを消費する行為を完全に分離します。また、ジェネレーターはステートフルで、一度に 1 行しか処理しないため、文字列のリスト、Web 要求の出力、別のプログラミングからの入力など、ファイルのように見えるものなら何でも渡すことができますsys.stdin。

python - Python のテキスト行から可変数の文字を分割または削除しますか?

4 に答える 4

Related

Reference