2015-07-13 Using Antlr4 to write a grammar
Here are a few tricks I discovered when I implemented my first grammar
using Antlr 4.
First, it really helps to be able to test some parts
of your grammar. One of the options is the plugin
IntelliJ Idea Plugin for ANTLR v4
for the following tool:
IntelliJ IDEA (the plugin works
with the community edition, see also
Installing, Updating and Uninstalling Repository Plugins).
You will find many grammar examples at antlr/grammars-v4.
The tool tells you if the grammar compiles and when it cannot parse an example
as it displays a graphs with the recognized pieces.
There are three kinds of objects:
- fragment: (syntax: fragment [Name] : [definition] ;), it represents
pieces of a grammar entity, they do not appear after a text was parsed
as they do not mean anything by themselves.
- token: (syntax: [Name] : [definition] ;), it represents keywords,
tokens, it should avoid ambiguity. For example, if a token (+ for example)
can have two meanings (unary, binary), these two meanings should be defined in two rules.
- rule: (syntax: [name] : [definition] ;), defines the grammar of the languages,
Antlr offers then the possibility to walk through the identified rules.
About the syntax,
[Name] means the name must begin by an upper letter,
[name], it must begin by a lower letter. All objects are defined with a syntax
very similar to regular expressions.
The grammar
DOT
(
graphviz language)
is a simple example to begin with. We can see that:
- The objects appears from the most complex first to the most simple at the end of the grammar.
- The file ends with the rule: WS : [ \t\n\r]+ -> skip ; which means that if a space
could not be matched by a previous rule, it should be skipped. Spaces are
implicitely considered as separators.
- The file begins with grammar DOT; where DOT is the name of the grammar.
- The example Java.g4
contains a main rule compilationUnit: packageDeclaration? importDeclaration* typeDeclaration* EOF ;.
EOF means the end of file. It forces the grammar to parse the whole file.
This tells us more about how Antlr tries to match the rule with the text
(not sure I'm right about this):
it tries rules in the order they are defined by the grammar,
it stops searching whenever it finds a token
or a fragment (ambiguity is not possible for fragments and tokens).
I went through many mistakes when building my first grammar. One of them was looking that way:
line 1:0 mismatched input 'aa' expecting {'something', 'aa'}
As you noticed 'aa' was expected but not matched.
It was usually due to some ambiguity. To detect the conflict,
With the plugin mentioned above,
I checked the rule the string rule was supposed to match (it fails),
I removed all rules above, it usually worked. By adding them back one by one
in the grammar, it became easier to understand where the conflict was.
On Python, I added function to module
pyensae to build and use Antlr4 grammar.
See antlr_grammar_build.py,
antlr_grammar_use.py.