java - Java Regex を使用して CSV ファイルの特定の部分を破棄する方法

Question

Javaで解析する必要があるこの種のCSVファイルがあります。

2012-11-01 00,  1106,   2194.1971066908
2012-11-01 01,  760,    1271.8460526316
.
.
.
2012-11-30 21,  1353,   1464.0014781966
2012-11-30 22,  1810,   1338.8331491713
2012-11-30 23,  1537,   1222.7826935589
        
720 rows selected.      
        
Elapsed: 00:37:00.23

これは、各列を分離してリストに格納するために作成した Java コードです。

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();
                
                while (readBuff!=null){
                    
                    Pattern checkData = Pattern.compile("[a-zA-Z]");
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (match.find()){
                        readBuff = null;
                    }
                    
                    else if (!match.find()){
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

問題は、私が作成した正規表現に「720 行が選択されました」というテキストがまだ含まれており、dHourList に格納されているため、少し不完全であることです。dHourList
は、この「2012-11-01 00...etc」のように表される日付列のみを格納する必要があります。

これに対する正しい正規表現は何ですか?

アップデート

2012 年 11 月 30 日 21 2012 年 11 月 30 日 22 2012 年 11 月 30 日 23

720 行が選択されました。

経過時間: 00:37:00.23

日時のサイズ: 724 スループットのサイズ: 720 平均応答時間のサイズ: 720

代わりにこれをcheckData正規表現で使用しました.1つのスラッシュを使用すると \d コンパイルすると無効なエスケープシーケンスが表示されるためです

Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$");

しかし、まだ720行が選択されており、そこにあるはずのない別の行が表示されています。

更新 2

作業コード:

while (readBuff!=null){
                    
                    
                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    
                    Matcher match = checkData.matcher(readBuff);
                    
                    if (!match.find()){
                        readBuff = null;
                    }
                    
                    else{
                        
                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");
                        
                            for (int x=0; x<splitReadBuffByComma.length; x++){
                                
                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }
                    
                    readBuff = bf.readLine();
                }

else if条件を削除してelseに変更し、Cylianが提案した正規表現を使用した結果、出力が得られました

2012-11-30 21
2012-11-30 22
2012-11-30 23

Size of date-hour: 720 size of throughput: 720 size of avg resp time: 720

どうもありがとう！

score 1 · Accepted Answer

まず、正規表現^の先頭にa を挿入するのが理にかなっています。checkData次に、式は文字列全体ではなく行頭でのみ検索されるため、高速になります。

最後の行のように、式のようなより多くの日付形式 (4 つの数字とダッシュなど) で正規表現を開始することができます。行数の後にダッシュが表示されることはありません。

多分このようなもの：

Pattern checkData = Pattern.compile("^\\d\\d\\d\\d-");

予期しないデータが得られないことが確実な場合は、これで十分です。不正な形式の csv データがある場合でもプログラムが機能することを確認したい場合は、正規表現を拡張して行全体をキャプチャし、matches()代わりに使用します.

score 1 · Accepted Answer

これを試してください[あなたのコードですが、少し変更されています]：

public void extractFile(String fileName){
        try{
            BufferedReader bf = new BufferedReader(new FileReader(fileName));
            try {
                String readBuff = bf.readLine();

                while (readBuff!=null){

                    Pattern checkData = Pattern.compile("^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\b.+$");
                    Matcher match = checkData.matcher(readBuff);

                    if (!match.find()){
                        readBuff = null;
                    }

                    else if (match.find()){

                        String[] splitReadBuffByComma = new String[3];
                        splitReadBuffByComma = readBuff.split(",");

                            for (int x=0; x<splitReadBuffByComma.length; x++){

                                if (x==0){
                                    dHourList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==1){
                                    throughputList.add(splitReadBuffByComma[x]);
                                }
                                else if (x==2){
                                    avgRespTimeList.add(splitReadBuffByComma[x]);
                                }
                            }
                    }

                    readBuff = bf.readLine();
                }
            }
            finally{
                bf.close();
            }
        }
        catch(FileNotFoundException e){
            System.out.println("File not found dude: "+ e);
        }
        catch(IOException e){
            System.out.println("Error Exception dude: "+e);
        }
    }

正規表現の構造

# ^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\b.+$
# 
# Options: ^ and $ match at line breaks
# 
# Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
# Match the regular expression below and capture its match into backreference number 1 «(19|20)»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «19»
#       Match the characters “19” literally «19»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «20»
#       Match the characters “20” literally «20»
# Match a single digit 0..9 «\d»
# Match a single digit 0..9 «\d»
# Match the regular expression below and capture its match into backreference number 2 «([-/.])»
#    Match a single character present in the list “-/.” «[-/.]»
# Match the regular expression below and capture its match into backreference number 3 «(0[1-9]|1[012])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (the entire group fails if this one fails to match) «1[012]»
#       Match the character “1” literally «1»
#       Match a single character present in the list “012” «[012]»
# Match the same text as most recently matched by capturing group number 2 «\2»
# Match the regular expression below and capture its match into backreference number 4 «(0[1-9]|[12][0-9]|3[01])»
#    Match either the regular expression below (attempting the next alternative only if this one fails) «0[1-9]»
#       Match the character “0” literally «0»
#       Match a single character in the range between “1” and “9” «[1-9]»
#    Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[12][0-9]»
#       Match a single character present in the list “12” «[12]»
#       Match a single character in the range between “0” and “9” «[0-9]»
#    Or match regular expression number 3 below (the entire group fails if this one fails to match) «3[01]»
#       Match the character “3” literally «3»
#       Match a single character present in the list “01” «[01]»
# Assert position at a word boundary «\b»
# Match any single character that is not a line break character «.+»
#    Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at the end of a line (at the end of the string or before a line break character) «$»

アップデート

私が理解している限り、入力文字列には日付で始まる多くの行が含まれていますが、カンマは含まれていません。このために、前のパターンを次のように変更します。

^(19|20)\d\d([-/.])(0[1-9]|1[012])\2(0[1-9]|[12][0-9]|3[01])\s+\d+,[^,]+,[^,]+$

またescaped

^(19|20)\\d\\d([-/.])(0[1-9]|1[012])\\2(0[1-9]|[12][0-9]|3[01])\\s+\\d+,[^,]+,[^,]+$

score 1 · Accepted Answer

正規表現で行う必要はありません。（あなたの例として示されている場合）

あなたはどちらかをチェックすることができます

行にコンマ " ," が含まれている場合、または
分割された配列の長さが 3 または
while 条件を少し変更すると、行が「selected.」で終わる場合は飛び出します。

java - Java Regex を使用して CSV ファイルの特定の部分を破棄する方法

アップデート

更新 2

3 に答える 3

Related

Reference