ANTLR for Ruby

Parsers

updated Sunday, August 04, 2013 at 09:59PM EDT

Parser Code and Class Structure

For a combined or pure parser grammar named Language, antlr4ruby will generate a parser class. In other ANTLR targets, the generated class is named LanguageParser. However, this ruby implementation will generate a top level module named Language, which serves as a general name-space for the various entities required by the code. The actual parser class is simply named Parser, so the full name of the generated parser will be Language::Parser. Consider this combined grammar:

01
grammar AddingMachine;
02
03
options {
04
  language = Ruby;
05
}
06
07
expression returns[ value ]
08
  : a=NUMBER '+' b=NUMBER $value = $a.text.to_i + $b.text.to_i }
09
  | a=NUMBER '-' b=NUMBER $value = $a.text.to_i - $b.to_i }
10
  ;
11
12
NUMBER: ( '0' .. '9' )+;
13
SPACE: ' '+ $channel = HIDDEN };

An abbreviated form of the output generated by antlr4ruby AddingMachine.g is shown below:

01
# edited out run-time library require procedure
02
03
module AddingMachine
04
  # TokenData defines all of the token type integer values
05
  # as constants, which will be included in all 
06
  # ANTLR-generated recognizers.
07
  const_defined?(:TokenData) or TokenData = ANTLR3::TokenScheme.new
08
  
09
  module TokenData
10
    # define the token constants
11
    define_tokens( :NUMBER => 4, :EOF => -1, :SPACE => 5, :T__7 => 7, :T__6 => 6 )
12
13
    # register the proper human-readable name or literal value
14
    # for each token type
15
    #
16
    # this is necessary because anonymous tokens, which are
17
    # created from literal values in the grammar, do not
18
    # have descriptive names
19
    register_names( "NUMBER", "SPACE", "'+'", "'-'" )
20
  end
21
22
23
  class Parser < ANTLR3::Parser
24
    @grammar_home = AddingMachine
25
26
    RULE_METHODS = [ :expression ].freeze
27
28
    include TokenData
29
30
    generated_using( "./AddingMachine.g", "3.2.1-SNAPSHOT Dec 18, 2009 04:29:28", "1.6.3" )
31
32
    def initialize( input, options = {} )
33
      super( input, options )
34
    end
35
    
36
    # - - - - - - - - - - - - Rules - - - - - - - - - - - - -
37
    
38
    # parser rule expression
39
    
40
    # (in ./AddingMachine.g)
41
    # 7:1: expression returns [ value ] : (a= NUMBER '+' b= NUMBER | a= NUMBER '-' b= NUMBER );
42
    
43
    def expression
44
      # edited 50+ lines of recognition logic
45
    end
46
    
47
    # edited out various other support code
48
  end # class Parser < ANTLR3::Parser
49
end
50

Thus, the generated code for a parser creates the following named entities:

  1. module Language – where Language is the name of the input grammar
  2. class Language::Parser < ANTLR3::Parser – the parser implementation
  3. module Language::TokenData – an ANTLR3::TokenScheme (subclass of Module), which is used to define token types and a token class
  4. class Language::TokenData::Token < ANTLR3::CommonToken – not apparent in the code above, this class is dynamically created along with Language::TokenData

Instantiating Parsers

Providing a Token Stream

A parser must be provided with a stream of tokens to recognize.

01
lexer = AddingMachine::Lexer.new( "1 + 1" )
02
tokens = ANTLR3::CommonTokenStream.new( lexer )
03
parser = AddingMachine::Parser.new( tokens )
04

Providing a Lexer or ANTLR3::TokenSource Object

01
lexer = AddingMachine::Lexer.new( "1 + 1" )
02
parser = AddingMachine::Parser.new( lexer )
03

Providing an Input String or File

If the parser class is able to automatically figure out which lexer class to use to tokenize the input, the four-step instantiation process can be reduced to a single step as demonstrated below.

01
parser = AddingMachine::Parser.new( "1 + 1" )
02
03
parser =
04
  open( 'sums.txt' ) { | f | AddingMachine::Parser.new( f ) }
05

This can only happen in these circumstances:

Since AddingMachine.g is a combined grammar, the sample code above works without any extra work. To demonstrate the second scenario, consider rewriting AddingMachine as a pure parser grammar. While this example is somewhat contrived and impractical, say you would like to define two different lexers — one for decimal numbers and one for hexadecimal numbers. So you write two different lexers:

01
lexer grammar Decimal;
02
03
options { language = Ruby; }
04
05
@token::members {
06
  def value
07
    return text.to_i
08
  end
09
}
10
11
NUMBER: ( '0' .. '9' )+;
12
PLUS: '+';
13
MINUS: '-';
14
SPACE: ' '+ $channel = HIDDEN };
01
lexer grammar Hexadecimal;
02
03
options {  language = Ruby; }
04
05
@token::members {
06
  def value
07
    return text.to_i( 16 )
08
  end
09
}
10
11
NUMBER: ( '0' .. '9' | 'a' .. 'f' | 'A' .. 'F' )+;
12
PLUS: '+';
13
MINUS: '-';
14
SPACE: ' '+ $channel = HIDDEN };

So say you imagine the most common usage scenario for AddingMachine is using decimal numbers. If you still want to use automatic input tokenization, you would write AddingMachine like this:

01
parser grammar AddingMachine;
02
03
options {  language = Ruby; }
04
05
@members {
06
  require 'Decimal'
07
  @associated_lexer = Decimal::Lexer
08
}
09
10
expression returns[ value ]
11
  : a=NUMBER PLUS  b=NUMBER $value = $a.value + $b.value }
12
  | a=NUMBER MINUS b=NUMBER $value = $a.value - $b.value }
13
  ;

After generating code for all three grammars, the following code will work correctly:

01
require 'AddingMachine'
02
03
AddingMachine::Parser.new( "100 - 10" ).expression    # => 90
04
05
require 'Hexadecimal'
06
lexer = Hexadecimal::Lexer.new( "FF - 01" )
07
AddingMachine::Parser.new( lexer ).expression         # => 254
08

Parser Rules

Each parser rule is implemented as a method of the parser class. Thus, rule statement in grammar Language.g will be implemented as method Language#statement.

Rule Method Arguments

ANTLR does allow rules to be specified with arguments. For example:

01
grammar Args;
02
03
statement[ include_loops ]
04
  : { include_loops }? ( loop | conditional )
05
  | conditional
06
  ;

will produce code that contains

01
def statement( include_loops )
02
  # a whole bunch of recognition code ...
03
end
04

Rule argument specification is more limited than ruby’s argument specification syntax. Unfortunately, this is beyond my control as a target developer. Rule argument specification is mostly controlled by ANTLR’s core syntax and the ANTLR tool’s semantics, which is somewhat skewed in favor of Java’s semantics. Thus,

Rule Method Visibility

By default, all rule methods are public. However, ANTLR permits specifying rule visibility with modifiers public, protected, or private. The generated source code will honor these modifiers, setting the rule method’s visibility.

01
grammar Whatevs;
02
03
public a      : ID '=' b ;
04
protected b   : c | ID ;
05
private c     : NUM | STR ;