antlr - より一般的なトークンが利用可能な場合、字句解析中に連結されたトークンを分離しておくにはどうすればよいですか

Question

私が取り組んでいる言語では、特定のトークンをくっつけることができます (「intfloat」など)。レクサーがそれらを ID に変換しないようにして、解析時に個別に使用できるようにする方法を探しています。それを示す最も単純な文法は次のとおりです（WSは省略されています）：

B: 'B';
C: 'C';
ID: ('a'..'z')+;
doc : (B | C | ID)* EOF;

実行:

bc
abc
bcd

レクサーから欲しいもの：

B C
ID (starts with not-a-keyword so it's an ID)
<error> (cannot concat non-keywords)

しかし、予想どおり、3 つの ID が取得されます。

私は、ID を貪欲ではなく、各キャラクターの個別のトークンに退化させることを検討してきました。あとで接着剤でくっつけてもいいと思いますが、もっといい方法があるはずです。

何かご意見は？

ありがとう

score 2 · Accepted Answer

これが解決に向けたスタートです。レクサーを使用してテキストをトークンに分割します。ここでの秘訣は、ルールIDが呼び出しごとに複数のトークンを発行できることです。これは非標準のレクサー動作であるため、いくつかの注意点があります。

これはANTLR4では機能しないと確信しています。
このコードは、すべてのトークンがにキューイングされていることを前提としていtokenQueueます。
ルールIDはキーワードの繰り返しを妨げないため、intintintトークンを生成しINT INT INTます。それが悪い場合は、文法でどちらがより理にかなっているかに応じて、レクサー側またはパーサー側のいずれかでそれを処理することをお勧めします。
キーワードが短いほど、このソリューションは脆弱になります。入力はキーワードで始まり、その後にキーワード以外の文字列が続くためinternal、無効です。IDint
文法は、私が消去していない警告を生成します。このコードを使用する場合は、それらを削除することをお勧めします。

文法は次のとおりです。

MultiToken.g

grammar MultiToken;


@lexer::members{
    private java.util.LinkedList<Token> tokenQueue = new java.util.LinkedList<Token>();

    @Override
    public Token nextToken() {
            Token t = super.nextToken();
            if (tokenQueue.isEmpty()){
                if (t.getType() == Token.EOF){
                    return t;
                } else { 
                    throw new IllegalStateException("All tokens must be queued!");
                }
            } else { 
                return tokenQueue.removeFirst();
            }
    }

    public void emit(int ttype, int tokenIndex) {
        //This is lifted from ANTLR's Lexer class, 
        //but modified to handle queueing and multiple tokens per rule.
        Token t;

        if (tokenIndex > 0){
            CommonToken last = (CommonToken) tokenQueue.getLast();
            t = new CommonToken(input, ttype, state.channel, last.getStopIndex() + 1, getCharIndex() - 1);
        } else { 
            t = new CommonToken(input, ttype, state.channel, state.tokenStartCharIndex, getCharIndex() - 1);
        }

        t.setLine(state.tokenStartLine);
        t.setText(state.text);
        t.setCharPositionInLine(state.tokenStartCharPositionInLine);
        emit(t);
    }

    @Override
    public void emit(Token t){
        super.emit(t);
        tokenQueue.addLast(t);
    }
}

doc     : (INT | FLOAT | ID | NUMBER)* EOF;

fragment
INT     : 'int';

fragment
FLOAT   : 'float';

NUMBER  : ('0'..'9')+;

ID  
@init {
    int index = 0; 
    boolean rawId = false;
    boolean keyword = false;
}
        : ({!rawId}? INT {emit(INT, index++); keyword = true;}
            | {!rawId}? FLOAT {emit(FLOAT, index++); keyword = true;}
            | {!keyword}? ('a'..'z')+ {emit(ID, index++); rawId = true;} 
          )+
        ;

WS      : (' '|'\t'|'\f'|'\r'|'\n')+ {skip();};

テストケース1：混合キーワード

入力

intfloat a
int b
float c
intfloatintfloat d

出力（トークン）

[INT : int] [FLOAT : float] [ID : a] 
[INT : int] [ID : b]
[FLOAT : float] [ID : c] 
[INT : int] [FLOAT : float] [INT : int] [FLOAT : float] [ID : d]

テストケース2：キーワードを含むID

入力

aintfloat
bint
cfloat
dintfloatintfloat

出力（トークン）

[ID : aintfloat] 
[ID : bint] 
[ID : cfloat] 
[ID : dintfloatintfloat]

テストケース3：不正なID＃1

入力

internal

出力（トークンとレクサーエラー）

[INT : int] [ID : rnal] 
line 1:3 rule ID failed predicate: {!keyword}?

テストケース4：不正なID＃2

入力

floatation

出力（トークンとレクサーエラー）

[FLOAT : float] [ID : tion] 
line 1:5 rule ID failed predicate: {!keyword}?

テストケース5：非IDルール

入力

int x
float 3 float 4 float 5
5 a 6 b 7 int 8 d

出力（トークン）

[INT : int] [ID : x] 
[FLOAT : float] [NUMBER : 3] [FLOAT : float] [NUMBER : 4] [FLOAT : float] [NUMBER : 5] 
[NUMBER : 5] [ID : a] [NUMBER : 6] [ID : b] [NUMBER : 7] [INT : int] [NUMBER : 8] [ID : d]

score 1 · Accepted Answer

ANTLR 4 のほぼすべての文法ソリューションを次に示します (ターゲット言語で 1 つの小さな述語のみが必要です)。

lexer grammar PackedKeywords;

INT : 'int' -> pushMode(Keywords);
FLOAT : 'float' -> pushMode(Keywords);

fragment ID_CHAR : [a-z];
ID_START : ID_CHAR {Character.isLetter(_input.LA(1))}? -> more, pushMode(Identifier);
ID : ID_CHAR;

// these are the other tokens in the grammar
WS : [ \t]+ -> channel(HIDDEN);
Newline : '\r' '\n'? | '\n' -> channel(HIDDEN);

// The Keywords mode duplicates the default mode, except it replaces ID
// with InvalidKeyword. You can handle InvalidKeyword tokens in whatever way
// suits you best.
mode Keywords;

    Keywords_INT : INT -> type(INT);
    Keywords_FLOAT : FLOAT -> type(FLOAT);
    InvalidKeyword : ID_CHAR;
    // must include every token which can follow the Keywords mode
    Keywords_WS : WS -> type(WS), channel(HIDDEN), popMode;
    Keywords_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;

// The Identifier mode is only entered if we know the current token is an
// identifier with >1 characters and which doesn't start with a keyword. This is
// essentially the default mode without keywords.
mode Identifier;

    Identifier_ID : ID_CHAR+ -> type(ID);
    // must include every token which can follow the Identifiers mode
    Identifier_WS : WS -> type(WS), channel(HIDDEN), popMode;
    Identifier_Newline : Newline -> type(Newline), channel(HIDDEN), popMode;

この文法は、ANTLRWorks 2 lexer インタープリター (近日公開予定!) でも、1 文字の識別子を除くすべてのものに対して機能します。lexer インタープリターは述語 in を評価できないため、 (インタープリターで) のID_STARTような入力は、チャネルの typeのa<space>テキストを含む単一のトークンを生成します。a<space>WSHIDDEN

antlr - より一般的なトークンが利用可能な場合、字句解析中に連結されたトークンを分離しておくにはどうすればよいですか

2 に答える 2

MultiToken.g

テストケース1：混合キーワード

テストケース2：キーワードを含むID

テストケース3：不正なID＃1

テストケース4：不正なID＃2

テストケース5：非IDルール

Related

Reference