python - スペースが削除されたときの正規表現の一致、スペースを含む元の文字列から一致した文字を削除する方法は?

Question

（免責事項：これは私の最初のスタックオーバーフローの質問なので、あまり明確でない場合は事前に許してください）

予想された結果：

私の仕事は、会社名を表す文字列で会社の法的識別子を見つけ、それからそれらを分離し、別の文字列に保存することです。会社名は既に消去されているため、英数字の小文字のみが含まれています。

例：

company_1 = 'uber wien abcd gmbh'
company_2 = 'uber wien abcd g m b h'
company_3 = 'uber wien abcd ges mbh'

結果として

company_1_name = 'uber wien abcd'
company_1_legal = 'gmbh'
company_2_name = 'uber wien abcd'
company_2_legal = 'gmbh'
company_3_name = 'uber wien abcd'
company_3_legal = 'gesmbh'

私が今いる場所：

csv ファイルからすべての会社 ID のリストを読み込みます。オーストリアが良い例です。2 つの正当な ID は次のとおりです。

gmbh
gesmbh

会社名に法的識別子が含まれているかどうかを示す正規表現を使用します。ただし、この正規表現は、正当な ID を識別するために文字列からすべてのスペースを削除します。

company_1_nospace = 'uberwienabcdgmbh'
company_2_nospace = 'uberwienabcdgmbh'
company_3_nospace = 'uberwienabcdgesmbh'

スペースのない文字列で正規表現を探すので、3 つの会社すべてが名前の中に正式な ID を持っていることがわかります。

私が立ち往生している場所：

company_1、、company_2およびに正当な ID があるかどうかはわかりますが、company_3からは削除することしかできませんcompany_1。実際には一致しないので外すg m b hことはできませんが、合法的なIDと言えます。私がそれを削除できる唯一の方法は、会社名の残りのスペースも削除することですが、これはやりたくありません (これは最後の手段にすぎません)。

gmbhに合わせてにスペースを入れても、やg m b hは拾わges mbhないges m b h。（他の国でも同じことが起こることに注意してください）

私のコード:

import re
re_code = re.compile('^gmbh|gmbh$|^gesmbh|gesmbh$')
comp_id_re = re_code.search(re.sub('\s+', '', company_name))
if comp_id_re:
    company_id = comp_id_re.group()
    company_name = re.sub(re_code, '', company_name).strip()
else:
    company_id = ''

Python が元の文字列から削除する文字を理解する方法はありますか? それとも、どういうわけか（それは別の問題です）、正当なID間隔のすべての可能な代替案を見つけた方が簡単でしょうか? つまり、、、、、などをgmbh作成し、g mbhそれをマッチング/抽出に使用しますか?gm bhgmb hg m bh

私の説明で十分に明確になったことを願っています。これのタイトルを考えるのはかなり難しかったです。

更新 1:通常、会社 ID は会社名の文字列の末尾にあります。一部の国では、最初に表示されることがあります。

更新 2:これにより、会社名内の会社 ID が処理されると思います。会社名の末尾にある法的な ID には機能しますが、先頭にある会社 ID には機能しません。

legal_regex = '^ltd|ltd$|^gmbh|gmbh$|^gesmbh|gesmbh$'
def foo(name, legal_regex):
    #compile regex that matches company ids at beginning/end of string
    re_code = re.compile(legal_regex)
    #remove spaces
    name_stream = name.replace(' ','')
    #find regex matches for legal ids
    comp_id_re = re_code.search(name_stream)
    #save company_id, remove it from string
    if comp_id_re:
        company_id = comp_id_re.group()
        name_stream = re.sub(re_code, '', name_stream).strip()
    else:
        company_id = ''
    #restore spaced string (only works if id is at the end)
    name_stream_it = iter(name_stream)
    company_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name)
       return (company_name, company_id)

score 0 · Accepted Answer

私は許容できる解決策にたどり着いたと思います。元のコードの一部、@ Abhijit のコードの一部、および @ wei2912 のコードの背後にある主なアイデアを使用しました。皆さん、ありがとうございました

使用するコードは次のとおりです。

legal_ids = '^ltd|ltd$|^gmbh|gmbh$|^gesmbh|gesmbh$'

def foo(name, legal_ids):
    #initialize re (company id at beginning or end of string)
    re_code = re.compile(legal_ids)
    #remove spaces from name
    name_stream = name.replace(' ','')
    #search for matches
    comp_id_re = re_code.search(name_stream)
    if comp_id_re:
        #match was found, extract the matching company id
        company_id = comp_id_re.group()
        #remove the id from the string without spaces
        name_stream = re.sub(re_code, '', name_stream).strip()
        if comp_id_re.start()>0:
            #the legal id was NOT at the beginning of the string, proceed normally
            name_stream_it = iter(name_stream)
            final_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name)
        else:
            #the legal id was at the beginning of the string, so do the same as above, but with the reversed strings
            name_stream_it = iter(name_stream[::-1])
            final_name = ''.join(next(name_stream_it) if e != ' ' else ' ' for e in name[::-1])
            #reverse the string to get it back to normal
            final_name = final_name[::-1]
    else:
        company_id = ''
        final_name = name
    return (final_name.strip(), company_id)

score 0 · Accepted Answer

これが私が思いついたコードです：

company_1 = 'uber wien abcd gmbh'
company_2 = 'uber wien abcd g m b h'
company_3 = 'uber wien abcd ges mbh'
legalids = ["gmbh", "gesmbh"]

def info(company, legalids):
    for legalid in legalids:
        found = []

        last_pos = len(company)-1
        pos = len(legalid)-1
        while True:
            if len(legalid) == len(found):
                newfound = found
                newfound.reverse()
                if legalid == ''.join(newfound):
                    return [company[:last_pos+1].strip(' '), legalid]
                else:
                    break

            if company[last_pos] == ' ':
                last_pos -= 1
                continue
            elif company[last_pos] == legalid[pos]:
                found.append(company[last_pos])
                pos -= 1
            else:
                break
            last_pos -= 1
    return

print(info(company_1, legalids))
print(info(company_2, legalids))
print(info(company_3, legalids))

出力：

['uber wien abcd', 'gmbh']
['uber wien abcd', 'gmbh']
['uber wien abcd', 'gesmbh']

python - スペースが削除されたときの正規表現の一致、スペースを含む元の文字列から一致した文字を削除する方法は?

3 に答える 3

Related

Reference