regex - Notepad ++：丸括弧のセットを含む非常に長い文字列からすべての単語を抽出します

Question

ドイツ語で書かれた大きな.txtファイルがあります。多くの人が話している記録です。単語の省略形が使用されている場合、正しい形の単語が括弧で囲まれて、その周りまたは内側に書かれています。この.txtに存在するそのような例をすべてリストとして抽出したいと思います。いくつかの正規表現を試しましたが、「単語」全体を強調することができないようです。

何か案は？

抽出したい単語が強調表示された.txtの一部を次に示します。

Ich hab（e） am Achtundzwanzigsten achtenneunzehnhundertneunzigGeburtstag。また、wenn ich mich beschreiben sollte、dann muss ich sagen freundlich、unkompliziertundbescheiden。Hallo wie gehts （geht es） dir。Naはmachst （machst du） den jetzt heut（e）でした。Und、eh、hm、nochでしたか？Stör（e） ich？Ja das is（t）、eh、so、würd（e） ich das sosagen...。

ありがとう！

score 2 · Accepted Answer

私があなたのニーズをよく理解しているなら、どうですか:

(\w+\(\w+\))| \([\w\s]+\)

説明：

The regular expression:

(?-imsx:(\w+\(\w+\))| \([\w\s]+\))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \(                       '('
----------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \)                       ')'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
                           ' '
----------------------------------------------------------------------
  \(                       '('
----------------------------------------------------------------------
  [\w\s]+                  any character of: word characters (a-z, A-
                           Z, 0-9, _), whitespace (\n, \r, \t, \f,
                           and " ") (1 or more times (matching the
                           most amount possible))
----------------------------------------------------------------------
  \)                       ')'
----------------------------------------------------------------------
)                        end of grouping

score 0 · Accepted Answer

Notepad++ は、POSIX に準拠していない可能性のある正規表現を使用しているため、単語境界をサポートしていません。(少なくとも v5.9.2 ではサポートされていません) 次の正規表現を試してください。

[^\s]*\([^)]*\)[^\s\.\,\;\?\!]*

[^\s]*: 単語の前の空白 (タブ、スペースなど) に一致しないことで、単語の始まりを検出します。
\([^)]*\): 括弧とその内容に一致
[^\s\.\,\;\?\!]*: 空白または可能な句読点記号に一致しないことによって、単語の終わりを検出します。

単語の前後に句読点 (引用符など) を追加することで、これを拡張できます。
サンプルテキストの Notepad++ v5.9.2 でこれを正常にテストしました。

score 0 · Accepted Answer

この正規表現は、との間(に)含まれるすべてのコンテンツと、その前のすべて(とその前のスペース文字を検索します。

[^ ]*\([^)]*\)

テキストを素敵なリストに変換するには:

検索/置換ダイアログを開く (Ctrl-H)
何を見つける:.*?([^ ]*\([^)]*\))
と置換する：\1\n
「改行に一致」にチェックを入れた「正規表現」
ファイルの先頭にカーソルを置いて「すべて置換」を押します（Ctrl-Home）
最後の行を無視または削除

これで、これらすべての単語がそれぞれ別の行にまとめられた素敵なリストができました。

regex - Notepad ++：丸括弧のセットを含む非常に長い文字列からすべての単語を抽出します

3 に答える 3

Related

Reference