java - 引用符で囲まれた文字列でない限り、コメントを一致させる方法は?

Question

だから私はいくつかの文字列を持っています:

//Blah blah blach
// sdfkjlasdf
"Another //thing"

そして、私はJava正規表現を使用して、次のように二重スラッシュを持つすべての行を置き換えています:

theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");

そして、それはほとんどの場合機能しますが、問題はすべての出現を削除することであり、引用された出現を削除しないようにする方法を見つける必要があります。どうすればそれを行うことができますか？

score 4 · Accepted Answer

Java ソースファイル全体を解析するパーサーを使用したり、関心のある部分のみを解析するものを自分で作成したりする代わりに、ANTLR などのサードパーティツールを使用できます。

ANTLRには、関心のあるトークンのみを定義する機能があります（もちろん、複数行のコメントや文字列および文字リテラルなどのトークンストリームを台無しにする可能性のあるトークンも）。したがって、これらのトークンを正しく処理するレクサー (トークナイザーの別の言葉) を定義するだけで済みます。

これを文法と呼びます。ANTLR では、このような文法は次のようになります。

lexer grammar FuzzyJavaLexer;

options{filter=true;}

SingleLineComment
  :  '//' ~( '\r' | '\n' )*
  ;

MultiLineComment
  :  '/*' .* '*/'
  ;

StringLiteral
  :  '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
  ;

CharLiteral
  :  '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
  ;

上記をという名前のファイルに保存しますFuzzyJavaLexer.g。ここで ANTLR 3.2 をダウンロードし、ファイルと同じフォルダーに保存しFuzzyJavaLexer.gます。

次のコマンドを実行します。

java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g

FuzzyJavaLexer.javaソースクラスを作成します。

もちろん、レクサーをテストする必要があります。これは、という名前のファイルを作成し、FuzzyJavaLexerTest.javaその中に以下のコードをコピーすることで実行できます。

import org.antlr.runtime.*;

public class FuzzyJavaLexerTest {
    public static void main(String[] args) throws Exception {
        String source = 
            "class Test {                                 \n"+
            "  String s = \" ... \\\" // no comment \";   \n"+
            "  /*                                         \n"+
            "   * also no comment: // foo                 \n"+
            "   */                                        \n"+
            "  char quote = '\"';                         \n"+
            "  // yes, a comment, finally!!!              \n"+
            "  int i = 0; // another comment              \n"+
            "}                                            \n";
        System.out.println("===== source =====");
        System.out.println(source);
        System.out.println("==================");
        ANTLRStringStream in = new ANTLRStringStream(source);
        FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        for(Object obj : tokens.getTokens()) {
            Token token = (Token)obj;
            if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
                System.out.println("Found a SingleLineComment on line "+token.getLine()+
                        ", starting at column "+token.getCharPositionInLine()+
                        ", text: "+token.getText());
            }
        }
    }
}

FuzzyJavaLexer.java次に、次のようにしてandをコンパイルしますFuzzyJavaLexerTest.java。

javac -cp .:antlr-3.2.jar *.java

最後にFuzzyJavaLexerTest.classファイルを実行します。

// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest

また：

// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest

その後、コンソールに次のように表示されます。

===== source =====
class Test {                                 
  String s = " ... \" // no comment ";   
  /*                                         
   * also no comment: // foo                 
   */                                        
  char quote = '"';                         
  // yes, a comment, finally!!!              
  int i = 0; // another comment              
}                                            

==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!              
Found a SingleLineComment on line 8, starting at column 13, text: // another comment

簡単ですね。:)

score 2 · Accepted Answer

パーサーを使用して、文字ごとに決定します。

キックオフの例:

StringBuilder builder = new StringBuilder();
boolean quoted = false;

for (String line : string.split("\\n")) {
    for (int i = 0; i < line.length(); i++) {
        char c = line.charAt(i);
        if (c == '"') {
            quoted = !quoted;
        }
        if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
            break;
        } else {
            builder.append(c);
        }
    }
    builder.append("\n");
}

String parsed = builder.toString();
System.out.println(parsed);

score 1 · Accepted Answer

以下は、数年前に私が (Perl で) 書いた grep のようなプログラムからのものです。ファイルを処理する前に Java コメントを削除するオプションがあります。

# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file.  Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================

sub strip_java_comments
{
      s!(  (?: \" [^\"\\]*   (?:  \\.  [^\"\\]* )*  \" )
         | (?: \' [^\'\\]*   (?:  \\.  [^\'\\]* )*  \' )
         | (?: \/\/  [^\n] *)
         | (?: \/\*  .*? \*\/)
       )
       !
         my $x = $1;
         my $first = substr($x, 0, 1);
         if ($first eq '/')
         {
             "\n" x ($x =~ tr/\n//);
         }
         else
         {
             $x;
         }
       !esxg;
}

このコードは実際には正しく機能し、トリッキーなコメント/引用符の組み合わせにだまされることはありません。Unicode エスケープ (\u0022 など) にだまされる可能性がありますが、必要に応じて最初にそれらを簡単に処理できます。

Java ではなく Perl であるため、置換コードを変更する必要があります。同等の Java を作成する方法を簡単に説明します。待機する...

編集：私はこれを盛り上げました。おそらく作業が必要になります:

// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately.  You'll figure it out)

Pattern p = Pattern.compile(
       "(  (?: \" [^\"\\\\]*   (?:  \\\\.  [^\"\\\\]* )*  \" )" +  //    " ... "
       "  | (?: ' [^'\\\\]*    (?:  \\\\.  [^'\\\\]*  )*  '  )" +  // or ' ... '
       "  | (?: //  [^\\n] *    )" +                               // or // ...
       "  | (?: /\\*  .*? \\* / )" +                               // or /* ... */
       ")",
       Pattern.DOTALL  | Pattern.COMMENTS
);

Matcher m = p.matcher(entireInputFileAsAString);

StringBuilder output = new StringBuilder();

while (m.find())
{
    if (m.group(1).startsWith("/"))
    {
        // This is a comment. Replace it with a space...
        m.appendReplacement(output, " ");

        // ... or replace it with an equivalent number of newlines
        // (exercise for reader)
    }
    else
    {
        // We matched a quoted string.  Put it back
        m.appendReplacement(output, "$1");
    }
}

m.appendTail(output);
return output.toString();

score 1 · Accepted Answer

(これは、@finnw が彼の回答の下のコメントで尋ねた質問への回答です。正規表現が間違ったツールである理由の詳細な説明として、OP の質問に対する回答ではありません。)

ここに私のテストコードがあります:

String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";

String test = 
    "class Test {                                 \n"+
    "  String s = \" ... \\\" // no comment \";   \n"+
    "  /*                                         \n"+
    "   * also no comment: // but no harm         \n"+
    "   */                                        \n"+
    "  /* no comment: // much harm  */            \n"+
    "  char quote = '\"';  // comment             \n"+
    "  // another comment                         \n"+
    "  int i = 0; // and another                  \n"+
    "}                                            \n"
    .replaceAll(" +$", "");
System.out.printf("%n%s%n", test);

System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));

r0あなたの答えから編集された正規表現です。// and anotherそれ以外はすべて group(1) で一致するため、最後のコメント ( ) のみが削除されます。とが正しく機能するためには、複数行モード ( (?m)) を設定する必要が^ありますが、文字クラスが引き続き改行に一致する可能性があるため、この$問題は解決しません。

r1改行の問題を処理しますが、文字列リテラルではまだ正しく一致// no commentしません。これには 2 つの理由があり(?:[^\"\r\n]|\\\")ます。2 番目の部分のバックスラッシュに一致させるために、そのうちの 2 つだけを使用しました。

r2charそれを修正しますが、リテラル内の引用符、または複数行コメント内の単一行コメントを処理しようとはしません。それらもおそらく処理できますが、この正規表現はすでにベビーゴジラです。あなたは本当にそれがすべて成長したのを見たいですか？

score 0 · Accepted Answer

二重引用符で囲まれた文字列であるかどうかは、正規表現を使用してもわかりません。結局のところ、正規表現は単なるステートマシンです (拡張 abit の場合もあります)。BalusC またはthis oneによって提供されるパーサーを使用します。

正規表現が制限されている理由を知りたい場合は、正式な文法について読んでください。ウィキペディアの記事は良い出発点です。

java - 引用符で囲まれた文字列でない限り、コメントを一致させる方法は?

5 に答える 5

Related

Reference