java - 引用符の外側のカンマで分割

Question

私のプログラムは、ファイルから行を読み取ります。この行には、次のようなカンマ区切りのテキストが含まれています。

123,test,444,"don't split, this",more test,1

分割の結果を次のようにしたいと思います。

123
test
444
"don't split, this"
more test
1

を使用するとString.split(",")、次のようになります。

123
test
444
"don't split
 this"
more test
1

つまり、部分文字列のコンマは"don't split, this"区切り記号ではありません。これに対処する方法は？

score 158 · Accepted Answer

この正規表現を試すことができます：

str.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");

,これにより、偶数個の二重引用符が続く文字列が分割されます。つまり、二重引用符の外側のコンマで分割されます。これは、文字列にバランスの取れた引用符がある場合に機能します。

説明：

,           // Split on comma
(?=         // Followed by
   (?:      // Start a non-capture group
     [^"]*  // 0 or more non-quote characters
     "      // 1 quote
     [^"]*  // 0 or more non-quote characters
     "      // 1 quote
   )*       // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
   [^"]*    // Finally 0 or more non-quotes
   $        // Till the end  (This is necessary, else every comma will satisfy the condition)
)

(?x)正規表現で修飾子を使用して、コードにこのように入力することもできます。修飾子は正規表現の空白を無視するため、次のように複数の行に分割された正規表現を読みやすくなります。

String[] arr = str.split("(?x)   " + 
                     ",          " +   // Split on comma
                     "(?=        " +   // Followed by
                     "  (?:      " +   // Start a non-capture group
                     "    [^\"]* " +   // 0 or more non-quote characters
                     "    \"     " +   // 1 quote
                     "    [^\"]* " +   // 0 or more non-quote characters
                     "    \"     " +   // 1 quote
                     "  )*       " +   // 0 or more repetition of non-capture group (multiple of 2 quotes will be even)
                     "  [^\"]*   " +   // Finally 0 or more non-quotes
                     "  $        " +   // Till the end  (This is necessary, else every comma will satisfy the condition)
                     ")          "     // End look-ahead
                         );

score 21 · Accepted Answer

マッチできるのにスプリットする理由は?

何らかの理由で、簡単な解決策が言及されていなかったため、この質問を復活させます。美しくコンパクトな正規表現は次のとおりです。

"[^"]*"|[^,]+

これは、必要なすべてのフラグメントに一致します ( demo を参照)。

説明

で"[^"]*"、完全に一致します"double-quoted strings"
また|
[^,]+コンマ以外の任意の文字に一致します。

可能性のある改良は、代替の文字列側を改善して、引用符で囲まれた文字列にエスケープされた引用符を含めることができるようにすることです。

score 2 · Accepted Answer

これは、複雑な正規表現なしで非常に簡単に実行できます。

文字で分割します"。文字列のリストを取得します
リスト内の各文字列を処理する: リスト内の偶数位置にあるすべての文字列 (ゼロから始まるインデックス) を "," で分割し (リスト内のリストを取得します)、奇数の位置にあるすべての文字列をそのままにします (直接リスト内のリスト)。
リストのリストに参加して、リストだけを取得します。

'"' の引用を処理したい場合は、アルゴリズムを少し調整する必要があります (いくつかの部分を結合する、間違って分割する、分割を単純な正規表現に変更する) 必要がありますが、基本的な構造は維持されます。

したがって、基本的には次のようなものです。

public class SplitTest {
    public static void main(String[] args) {
        final String splitMe="123,test,444,\"don't split, this\",more test,1";
        final String[] splitByQuote=splitMe.split("\"");
        final String[][] splitByComma=new String[splitByQuote.length][];
        for(int i=0;i<splitByQuote.length;i++) {
            String part=splitByQuote[i];
            if (i % 2 == 0){
               splitByComma[i]=part.split(",");
            }else{
                splitByComma[i]=new String[1];
                splitByComma[i][0]=part;
            }
        }
        for (String parts[] : splitByComma) {
            for (String part : parts) {
                System.out.println(part);
            }
        }
    }
}

これは、約束されたラムダを使用すると、はるかにクリーンになります。

java - 引用符の外側のカンマで分割

5 に答える 5

Related

Reference