ANTLR for Ruby

Lexers

updated Sunday, August 04, 2013 at 09:59PM EDT

A lexer’s job is to take a stream of text and chop it up into a sequence of tokens — little strings categorized with type codes like NUMBER or KEYWORD. For more information on tokens and their characteristics, read Tokens.

For a general overview of lexers, check out:

How Lexers Differ from Parsers

I do not intend to cover many details of lexical analysis in this article. However, highlighting a few of the key differences between lexers and parsers should help clarify the documentation below for users who are new to ANTLR.

As described above, lexers chop raw language text into tokens. If a particular character sequence in the input cannot be matched against the lexical rules of the grammar, the lexer must handle the syntax error somehow and optionally report the error.

A parser’s task is to then take the stream of tokens extracted by the lexer and recognize any one of a number of larger syntactic statements. Parsers may have more than one point of entry. For example, a python parser will probably have one entry rule for parsing full python scripts. It will probably have another rule for parsing a single python statement read from an interactive prompt. Since python restricts the content permitted in its eval statement, the parser may have an additional entry rule for strings passed to eval.

In contrast, lexers typically have a single point of entry A grammar will define a number of different tokens that are possible within a language using lexer rules. ANTLR will derive an additional lexer rule that combines all of the lexical rules, invoking the most appropriate token rule based upon the next few characters in the input stream. Thus, while parsers can have any number of independent entry points, lexers automatically choose the most appropriate rule for the current input.

Lexer Code and Class Structure

For a combined or lexer grammar named Language, antlr4ruby will generate a lexer class. In other ANTLR targets, the generated class is named LanguageLexer. However, this ruby implementation will generate a top level module named Language, which serves as a general name-space for the various entities required by the code. The actual lexer class is simply named Lexer, so the full name of the generated lexer will be Language::Lexer. Consider this really simple lexer grammar:

Digit.g
01
lexer grammar Digit;
02
03
options {
04
  language = Ruby;
05
}
06
07
DIGIT: '0' .. '9';

An abbreviated form of the output generated by antlr4ruby Digit.g is shown below:

Digit.rb (abridged)
01
# edited out code that ensures the antlr3 run-time library is required
02
03
module Digit
04
  # TokenData defines all of the token type integer values
05
  # as constants, which will be included in all 
06
  # ANTLR-generated recognizers.
07
  const_defined?(:TokenData) or TokenData = ANTLR3::TokenScheme.new
08
09
  module TokenData
10
  
11
    # define the token constants
12
    define_tokens( :DIGIT => 4, :EOF => -1 )
13
    
14
  end
15
  
16
  
17
  class Lexer < ANTLR3::Lexer
18
    @grammar_home = Digit
19
    include TokenData
20
21
    begin
22
      generated_using( "Digit.g", "3.2.1-SNAPSHOT Dec 18, 2009 04:29:28", "1.6.3" )
23
    rescue NoMethodError => error
24
      error.name.to_sym == :generated_using or raise
25
    end
26
    
27
    RULE_NAMES   = ["DIGIT"].freeze
28
    RULE_METHODS = [:digit!].freeze
29
    
30
    def initialize(input=nil, options = {})
31
      super(input, options)
32
    end
33
    
34
    # - - - - - - - - - - - lexer rules - - - - - - - - - - - -
35
    # lexer rule digit! (DIGIT)
36
    # (in /home/kyle/lib/ruby/projects/antlr3/test/functional/lexer/basic/Digit.g)
37
    def digit!
38
       # edited out recognition logic
39
    end
40
    
41
    # main rule used to study the input at the current position,
42
    # and choose the proper lexer rule to call in order to
43
    # fetch the next token
44
    
45
    # usually, you don't make direct calls to this method,
46
    # but instead use the next_token method, which will
47
    # build and emit the actual next token
48
    def token!
49
      # at line 1:10: DIGIT
50
      digit!
51
    end
52
  end # class Lexer < ANTLR3::Lexer
53
end
54

Thus, the generated code for a lexer creates the following named entities:

  1. module Language – where Language is the name of the input grammar
  2. class Language::Lexer < ANTLR3::Lexer – the lexer implementation
  3. module Language::TokenData – an ANTLR3::TokenScheme (subclass of Module), which is used to define token types and a token class
  4. class Language::TokenData::Token < ANTLR3::CommonToken – not apparent in the code above, this class is dynamically created along with Language::TokenData

Instantiating Lexers

Providing an Input Stream

A lexer must be provided with a text source to tokenize. More specifically, a lexer takes in a stream object, which feeds it characters from a text source, and produces a series of token objects. For example, below is an illustration of creating an instance of Digit::Lexer with a stream object:

01
input = ANTLR3::StringStream.new( "123" )
02
lexer = Digit::Lexer.new( input )
03
04
input = ANTLR3::FileStream.new( 'numbers.txt' )
05
lexer = Digit::Lexer.new( input )
06

Providing Strings or Files Directly

More often than not, the stream a lexer will operate upon will be an instance of ANTLR3::StringStream or ANTLR3::FileStream. Thus, the runtime library for this binding provides convenience casting for the most common cases. The initialize method of a lexer will automatically cast plain strings and file objects into appropriate stream objects.

01
lexer = Digit::Lexer.new( "123" )
02
03
lexer =
04
  open( 'numbers.txt' ) do | f |
05
    Digit::Lexer.new( f )
06
  end
07

Fetching Tokens

01
require 'Digit'
02
03
lexer = Digit::Lexer.new( "123" )
04
05
lexer.next_token
06
# => DIGIT["1"] @ line 1 col 0 (0..0)
07
08
lexer.exhaust
09
# => [DIGIT["2"] @ line 1 col 1 (1..1), DIGIT["3"] @ line 1 col 2 (2..2)]
10
11
lexer.next_token
12
# => <EOF>
13
14
lexer.reset
15
lexer.map { | tk | tk.text.to_i } 
16
# => [1, 2, 3]
17

Preparing a Lexer for Parser Input

01
lexer = Digit::Lexer.new( "123" )
02
tokens = ANTLR3::CommonTokenStream.new( lexer )
03
# `tokens' can be passed to the constructor of an ANTLR parser for parsing
04

Rule to Method Name Mapping

In most usage scenarios, a developer does not need to invoke lexer rule methods directly; most lexer work is done by invoking Lexer#next_token. However, it is important to illustrate how a grammar’s identifiers are user in the output ruby code. This is an area in which the antlr3 package diverges from other ANTLR target conventions.

All ANTLR grammar rules are translated into method definitions. For parsers and tree parsers, the methods share same name as the rule. ANTLR requires named lexer rules to start with a capital letter. To avoid name conflicts with constants and to support style conventions in ruby, lexer rule names are altered when they become method names in lexer classes.

Consider the following simple grammar:

01
lexer grammar Variables;
02
options { language = Ruby; }
03
04
ID: ( 'a' .. 'z' | '_' ) ( 'a' .. 'z' | 'A' .. 'Z' | '_' | '0' .. '9' )*;
05
WHOLE_NUMBER: ( '0' .. '9' )+;
06
SPACE: ( ' ' | '\t' | '\n' | '\r' )+;

Below is an abridged skeleton of the code generated by antlr4ruby:

01
module Variables
02
#...
03
class Lexer < ANTLR3::Lexer
04
  #...
05
  def id!
06
    # recognition logic for ID
07
  end
08
09
  def whole_number!
10
    # recognition logic for WHOLE_NUMBER
11
  end
12
13
  def space!
14
    # recognition logic for SPACE
15
  end
16
17
  def token!
18
    # the auto-generated rule used to pick the lexer rule
19
  end
20
end
21
22
end
23

The method naming convention for lexer rule methods works as follows:

  1. reformat the name to use a lower case convention:
    1. ALL_CAPS_AND_UNDERSCORES names become all_caps_and_underscores
    2. CamelCase names become camel_case names, as in Rails
  2. append an exclamation point

Rationale For the Lexer Rule Method Naming Convention

Clearly there’s a bit of opinionated design in this naming convention. Ruby does allow method names to begin with a capital letter, as in Kernel#Array or Kernel#String. So why aren’t lexer rule methods named with the same rule name? Well,

  1. methods that begin with capital letters cannot be called privately without arguments or parentheses:
  2. token types are implemented as integer constants sharing the lexer rule name.

Thus, while reading the lexer code, it’d be easy to misinterpret references to token type X with rule method X(). All generated calls to method X would require extra parentheses, making the code a little messier. Furthermore, while ruby does not impose a specific stylistic convention, the community generally adopts the convention of using snake_case for method names.

The way other ANTLR language targets handles the name conflict is to define lexer rule ID with an extra leading lowercase t, making method names tID(). As a developer, I’m not a fan of name mangling like that and thus I chose to diverge from ANTLR conventions.

So why the exclamation point?

  1. It prevents lexer rule methods from conflicting with ruby keywords. A lexer rule named FOR or DEF will be defined with method name for! or def!, avoiding potential syntax errors.
  2. It allows lexer rule methods to be referenced privately without parentheses, making code a little cleaner. While id may refer to a local variable or a method call, id! always refers to a method call.

Important Note About Lexer Rule Names

This naming convention for lexer rules could potentially cause a significant bug in the generated code in one particular situation. Consider this grammar:

01
lexer grammar Strings;
02
03
FLOAT:   Integer '.' Integer;
04
INTEGER: Integer ( 'l' | 'L' )?;
05
06
fragment Integer: ( '0' .. '9' )+;

The method code is going to look something like:

01
  def float!
02
    # ... logic for FLOAT
03
  end
04
05
  def integer!
06
    # ... logic for INTEGER
07
  end
08
    
09
  def integer!
10
    # ... logic for Integer
11
  end
12

Thus the fragment rule Integer’s method will overwrite the full rule method for INTEGER. I have seen this sort of convention used in a grammar and, as a result of the lexer rule naming convention, will break the lexer code. This shouldn’t be a common issue as it’s not that sensible to have lexer rules “INTEGER” and “Integer” exist in the same grammar (“Integer” would make more sense as “DIGITS” in this example).

Currently, antlr4ruby does not detect this problem to issue a warning or do anything to differentiate these names. While I may change the naming convention or find a way to have ANTLR issue a warning in the future, be aware that lexer rules are effectively case insensitive as far as code generation is concerned.

Lexer Operation and Rule Methods

Essentially, a lexer’s token construction process functions like this:

  1. The user calls Lexer#next_token, defined in super-class ANTLR3::Lexer
    • #next_token clears any of lexer’s internal state information that may remain from prior calls
    • it sets up a rescue block to catch any recognition errors
    • it then calls the special lexer rule method, Lexer#token!, which is automatically generated by ANTLR
  2. Lexer#token! examines the next few characters in the input stream and attempts to choose the best fit from the lexer rules defined in the grammar.
    • if a rule is selected, #token! calls the method corresponding to the rule
    • if the current input does not appear to fit any rule, it raises a NoViableAlternative recognition error.
  3. The chosen lexer rule method runs the full course of its recognition logic, verifying that the current series of characters conforms to the token type specification.
    • during this process, the lexer populates its internal state record, recording the token type, the range of text matched, the token channel, the line/column position, any token properties set in action blocks by the developer, as well as called to Lexer#emit or Lexer#skip.
    • if the input doesn’t fit the type specifications of the rule, the rule method will raise a recognition error of some sort
    • the rule itself does not create a token nor does it return any significant value — just nil
  4. control returns to Lexer#next_token (from step 1)
    • if a recognition error was raised, control will turn over to error recovery and reporting methods
    • otherwise, if a token was successfully matched, #next_token will build a new token using the settings collected in its internal state and return the new token objects

Some additional notes on the tokenization process and lexer operation:

Fragment Rules

Your grammar may specify additional lexer rules that function purely as helpers for the main lexer rules. These rules are prefixed with the keyword fragment in the grammar file. They are not integrated into the automatically-generated #token! rule method and they do not get associated with a token type constant. As they are not part of the lexer’s main tokenization procedure and they exist mostly to be referenced by other rules, they are allowed to have parameter and return value specifications associated with them. Otherwise, they follow the same naming conventions and structure of normal lexer rule methods. Here’s an example of a lexer grammar using fragment rules:

Numbers.g : fragment lexer rule demonstration
01
lexer grammar Numbers;
02
03
tokens {
04
  COMPLEX;
05
}
06
07
FLOAT
08
  : ( '.' DIGITS EXP?
09
    |  DIGITS '.'? EXP
10
    |  DIGITS '.' ( DIGITS EXP? )?
11
    )
12
    ( IMAG $type = COMPLEX } )?
13
  ;
14
15
INT
16
  : ( '0' ( 'x' | 'X' ) HEX_DIGITS
17
    | DIGITS
18
    ) ( IMAG $type = COMPLEX } )?
19
  ;
20
21
fragment
22
DIGITS
23
  : ( '0' .. '9' )+
24
  ;
25
26
fragment
27
HEX_DIGITS
28
  : ( '0' .. '9' | 'a' .. 'f' | 'A' .. 'F' )+
29
  ;
30
31
fragment
32
IMAG
33
  : ( 'i' | 'I' | 'j' | 'J' )
34
  ;
35
36
fragment
37
EXP
38
  : ( 'e' | 'E' ) ( '+' | '-' )? DIGITS
39
  ;

This grammar will only produce tokens of type INT, FLOAT, or COMPLEX, as the four additional rules are fragment rules.

1 I plan on covering error handling in this guide in the future. For the time being, however, the RDoc documentation for ANTLR3::Recognizer and ANTLR3::Lexer should provide insight into ANTLR’s error-handling mechanisms.