python - 辞書からのbvCase非依存正規表現置換

Question

申し訳ありませんが、Googleが提供してくれたソリューションから、実用的なソリューションを見つけることができませんでした（一部のサイトのいくつかの「レシピ」はかなり近いものでしたが、かなり古く、何かが見つかりませんでした）それは私が探している結果を私に与えます。

ファイルの名前を変更しているので、ファイル名を吐き出す関数があります。これには、「test_string」を使用しています。つまり、すべてのドット（およびアンダースコア）などが最初に削除されます。これらが最も一般的であるためです。これらすべての教授のやり方は異なり、これらすべてのものを削除せずに処理（または見る）することは不可能です。5例：

test_string_1 = 'legal.studies.131.race.relations.in.the.United.States.'

'legal.studies'->'法学'

test_string_2 = 'mediastudies the triumph of bluray over hddvd'

'mediastudies'->'Media Studies'、'bluray'->'Blu-ray、' hddvd'->' HD DVD '

test_string_3 = 'computer Science Microsoft vs unix'

'コンピュータサイエンス'->'コンピュータサイエンス'、'unix'->'UNIX'

test_string_4 = 'Perception - metamers dts'

「知覚」はすでに良いでしょう（しかし誰が気にしますか）、全体像は彼らがそこにオーディオ情報を保持したいということです、それで「dts」-> DTS

test_string_5 = 'Perception - Cue Integration - flashing dot example aac20 xvid'

'aac20'->'AAC2.0'、'xvid'->'XviD'

現在、私はこれを次のようなもので実行しています。

new_string = re.sub(r'(?i)Legal(\s|-|)Studies', 'Legal Studies', re.sub(r'(?i)Sociology', 'Sociology', re.sub(r'(?i)Media(\s|-|)Studies', 'Media Studies', re.sub(r'(?i)UNIX', 'UNIX', re.sub(r'(?i)Blu(\s|-|)ray', 'Blu-ray', re.sub(r'(?i)HD(\s|-|)DVD', 'HD DVD', re.sub(r'(?i)xvid(\s|-|)', 'XviD', re.sub(r'(?i)aac(\s|-|)2(\s|-|\.|)0', 'AAC2.0', re.sub(r'(?i)dts', 'DTS', re.sub(r'\.', r' ', original_string.title()))))))))))

私はそれらをすべて1行にまとめています。私はそれをあまり変更/更新しておらず、（私の脳/ ADDが機能する方法）私が混乱していなければ他のことをしている間、それを可能な限り最小限/邪魔にならないようにする方が簡単ですもうこの部分で。

だから、私の例では：

new_test_string_1 = 'Legal Studies 131 Race Relations In The United States'
new_test_string_2 = 'Media Studies The Triumph Of Blu-ray Over HD DVD'
new_test_string_3 = 'Computer Science Microsoft Vs UNIX'
new_test_string_4 = 'Perception - Metamers DTS'
new_test_string_5 = 'Perception - Cue Integration - Flashing Dot Example AAC2.0 XviD'

しかし、私はこれらをどんどん持っているので、それは本当に私が辞書か何かを持ちたいようなものになり始めています-私はコードを狂ったものに爆破したくありませんが、私はしたいです追加する必要のある実際の例が出てきたら、新しい置換を追加できます（たとえば、オーディオコーデック/コンテナ/その他のものがたくさんあり、それらをすべて投入する必要があるようです）。このマスターリスト/辞書/その他で使用されている方法については意見がありません。

全体像：ファイル名のスペースとアンダースコアを修正し、大量のたわごとをキャピタライゼーションのものに置き換えています（現時点では、私が作成しているre.subsを除いて、普遍的にタイトルを大文字にしています。大文字と小文字が完全ではなく、出力に必要なスペース、ダッシュ、またはドットが入力に含まれる場合と含まれない場合があります）。

同様に、ワンライナーの名前のない（ラムダなどの）関数が望ましいでしょう。

PSいくつかの奇妙さと、最初の明確さの欠如について申し訳ありません。ここでの問題の1つは、私の専攻/研究にあります。ほとんどのものは実際には非常に単純です。他のクラスでは、すべてのBlu-ray、HD DVD、DTS、AAC2.0、XviDなどが必要です。

score 2 · Accepted Answer

>>> import re
>>> def string_fix(text,substitutions):
        text_no_dots = text.replace('.',' ').strip()
        for key,substitution in substitutions.items():
            text_no_dots = re.sub(key,substitution,text_no_dots,flags=re.IGNORECASE)
        return text_no_dots

>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> d = {
     r'Legal(\s|-|)Studies' : 'Legal Studies', 
     r'Sociology'           : 'Sociology', 
     r'Media(\s|-|)Studies' : 'Media Studies'
}
>>> string_fix(teststring,d)
'Legal Studies 131 race relations in the U S'

そして、これは辞書なしでそれを行うためのはるかに良い方法です

>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> def repl(match):
        return ' '.join(re.findall('\w+',match.group())).title()

>>> re.sub(r'Legal(\s|-|)Studies|Sociology|Media(\s|-|)Studies',repl,teststring.replace('.',' ').strip(),flags=re.IGNORECASE)
'Legal Studies 131 race relations in the U S'

score 1 · Accepted Answer

import re

def string_fix(filename, dict):
    filename = filename.replace('.', ' ')
    for key, val in dict.items():
        filename = re.sub(key, val, filename, flags=re.IGNORECASE)
    return filename

dict = {
         r'Legal[\s\-_]?Studies' : 'Legal Studies',
         r'Media[\s\-_]?Studies' : 'Media Studies',
         r'dts' : 'DTS',
         r'hd[\s\-_]?dvd': 'HD DVD',
         r'blu[\s\-_]?ray' : 'Blu-ray',
         r'unix' : 'UNIX',
         r'aac[\s\-_]?2[\.]?0' : 'AAC2.0',
         r'xvid' : 'XviD',
         r'computer[\s\-_]?science' : 'Computer Science'
     }

string_1 = 'legal.studies.131.race.relations.in.the.United.States.'
string_2 = 'mediastudies the triumph of bluray over hddvd'
string_3 = 'computer Science Microsoft vs unix'
string_4 = 'Perception - metamers dts'
string_5 = 'Perception - Cue Integration - flashing dot example aac20 xvid'

print(string_fix(string_1, dict))
print(string_fix(string_2, dict))
print(string_fix(string_3, dict))
print(string_fix(string_4, dict))
print(string_fix(string_5, dict))

python - 辞書からのbvCase非依存正規表現置換

2 に答える 2

Related

Reference