Using Antlr4 to write a grammar

antlr4

2015-07-13 Using Antlr4 to write a grammar

Here are a few tricks I discovered when I implemented my first grammar using Antlr 4. First, it really helps to be able to test some parts of your grammar. One of the options is the plugin IntelliJ Idea Plugin for ANTLR v4 for the following tool: IntelliJ IDEA (the plugin works with the community edition, see also Installing, Updating and Uninstalling Repository Plugins). You will find many grammar examples at antlr/grammars-v4. The tool tells you if the grammar compiles and when it cannot parse an example as it displays a graphs with the recognized pieces.

There are three kinds of objects:

fragment: (syntax: fragment [Name] : [definition] ;), it represents pieces of a grammar entity, they do not appear after a text was parsed as they do not mean anything by themselves.
token: (syntax: [Name] : [definition] ;), it represents keywords, tokens, it should avoid ambiguity. For example, if a token (+ for example) can have two meanings (unary, binary), these two meanings should be defined in two rules.
rule: (syntax: [name] : [definition] ;), defines the grammar of the languages, Antlr offers then the possibility to walk through the identified rules.

About the syntax, [Name] means the name must begin by an upper letter, [name], it must begin by a lower letter. All objects are defined with a syntax very similar to regular expressions. The grammar DOT (graphviz language) is a simple example to begin with. We can see that:

The objects appears from the most complex first to the most simple at the end of the grammar.
The file ends with the rule: WS : [ \t\n\r]+ -> skip ; which means that if a space could not be matched by a previous rule, it should be skipped. Spaces are implicitely considered as separators.
The file begins with grammar DOT; where DOT is the name of the grammar.
The example Java.g4 contains a main rule compilationUnit: packageDeclaration? importDeclaration* typeDeclaration* EOF ;. EOF means the end of file. It forces the grammar to parse the whole file.

This tells us more about how Antlr tries to match the rule with the text (not sure I'm right about this): it tries rules in the order they are defined by the grammar, it stops searching whenever it finds a token or a fragment (ambiguity is not possible for fragments and tokens). I went through many mistakes when building my first grammar. One of them was looking that way:

line 1:0 mismatched input 'aa' expecting {'something', 'aa'}

As you noticed 'aa' was expected but not matched. It was usually due to some ambiguity. To detect the conflict, With the plugin mentioned above, I checked the rule the string rule was supposed to match (it fails), I removed all rules above, it usually worked. By adding them back one by one in the grammar, it became easier to understand where the conflict was.

On Python, I added function to module pyensae to build and use Antlr4 grammar. See antlr_grammar_build.py, antlr_grammar_use.py.

Xavier Dupré

XD blog

2015-07-13 Using Antlr4 to write a grammar