Class: EBNF::LL1::Lexer

Inherits:

Object

Object
EBNF::LL1::Lexer

show all

Includes:: Unescape, Enumerable

Defined in:: lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary

Constants included from Unescape

Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR

Instance Attribute Summary collapse

#input ⇒ String

The current input string being processed.
#options ⇒ Hash readonly

Any additional options for the lexer.
#scanner ⇒ StringScanner readonly protected
#whitespace ⇒ Regexp readonly

Defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Class Method Summary collapse

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given input string or stream.
.unescape_codepoints(string) ⇒ String

Returns a copy of the given input string with all \uXXXX and \UXXXXXXXX Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
.unescape_string(input) ⇒ String

Returns a copy of the given input string with all string escape sequences (e.g. \n and \t) replaced with their unescaped UTF-8 character counterparts.

Instance Method Summary collapse

#each_token {|token| ... } ⇒ Enumerator (also: #each)

Enumerates each token in the input string.
#first(*types) ⇒ Token

Returns first token in input stream.
#initialize(input = nil, terminals = nil, **options) ⇒ Lexer constructor

Initializes a new lexer instance.
#lineno ⇒ Integer

The current line number (one-based).
#match_token(*types) ⇒ Token protected

Return the matched token.
#recover(*types) ⇒ Token

Skip input until a token is matched.
#shift ⇒ Token

Returns first token and shifts to next.
#skip_whitespace ⇒ Object protected

Skip whitespace, as defined through input options or defaults.
#token(type, value, **options) ⇒ Token protected

Constructs a new token object annotated with the current line number.
#valid? ⇒ Boolean

Returns true if the input string is lexically valid.

Methods included from Unescape

unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

Initializes a new lexer instance.

Parameters:

input (String, #to_s) (defaults to: nil)
terminals (Array<Array<Symbol, Regexp>, Terminal>) (defaults to: nil) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})
options[Integer] (Hash) —

a customizable set of options

Options Hash (**options):

:whitespace (Regexp) —

Whitespace between tokens, including comments

Raises:

(Error)

# File 'lib/ebnf/ll1/lexer.rb', line 94

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#input ⇒ `String`

The current input string being processed.

Returns:

(String)



123
124
125

# File 'lib/ebnf/ll1/lexer.rb', line 123

def input
  @input
end

#options ⇒ `Hash` (readonly)

Any additional options for the lexer.

Returns:

(Hash)



117
118
119

# File 'lib/ebnf/ll1/lexer.rb', line 117

def options
  @options
end

#scanner ⇒ `StringScanner` (readonly, protected)

Returns:

(StringScanner)



226
227
228

# File 'lib/ebnf/ll1/lexer.rb', line 226

def scanner
  @scanner
end

#whitespace ⇒ `Regexp` (readonly)

Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Returns:

(Regexp) —

defines whitespace, including comments, otherwise whitespace must be explicit in terminals



39
40
41

# File 'lib/ebnf/ll1/lexer.rb', line 39

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

Tokenizes the given input string or stream.

Parameters:

input (String, #to_s)
terminals (Array<Array<Symbol, Regexp>>) —

Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.
options (Hash{Symbol => Object})

Yields:

(lexer)

Yield Parameters:

lexer (Lexer)

Returns:

(Lexer)

Raises:

(Lexer::Error) —

on invalid input

# File 'lib/ebnf/ll1/lexer.rb', line 77

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ `String`

Returns a copy of the given input string with all \uXXXX and \UXXXXXXXX Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.

Parameters:

string (String)

Returns:

(String)

.unescape_string(input) ⇒ `String`

Returns a copy of the given input string with all string escape sequences (e.g. \n and \t) replaced with their unescaped UTF-8 character counterparts.

Parameters:

input (String)

Returns:

(String)

Instance Method Details

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

Enumerates each token in the input string.

Yields:

(token)

Yield Parameters:

token (Token)

Returns:

(Enumerator)

# File 'lib/ebnf/ll1/lexer.rb', line 146

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ `Token`

Returns first token in input stream

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 161

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#lineno ⇒ `Integer`

The current line number (one-based).

Returns:

(Integer)



220
221
222

# File 'lib/ebnf/ll1/lexer.rb', line 220

def lineno
  scanner.lineno
end

#match_token(*types) ⇒ `Token` (protected)

Return the matched token.

If the token was matched with a case-insensitive regexp, track this with the resulting Token, so that comparisons with that token are also case insensitive

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 248

def match_token(*types)
  @terminals.each do |term|
    next unless types.empty? || types.include?(term.type)
    #STDERR.puts "match[#{term.type}] #{scanner.rest[0..100].inspect} against #{term.regexp.inspect}" #if term.type == :STRING_LITERAL_SINGLE_QUOTE
    if term.partial_regexp && scanner.match?(term.partial_regexp) && !scanner.match?(term.regexp) && scanner.respond_to?(:ensure_buffer_full)
      scanner.ensure_buffer_full
    end

    if matched = scanner.scan(term.regexp)
      #STDERR.puts "  matched #{term.type.inspect}: #{matched.inspect}"
      tok = token(term.type, term.canonicalize(matched))
      return tok
    end
  end
  nil
end

#recover(*types) ⇒ `Token`

Skip input until a token is matched

Parameters:

types (Array[Symbol]) —

Optional set of types for restricting terminals examined

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 203

def recover(*types)
  until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shift ⇒ `Token`

Returns first token and shifts to next

Returns:

(Token)

# File 'lib/ebnf/ll1/lexer.rb', line 192

def shift
  cur = first
  @first = nil
  cur
end

#skip_whitespace ⇒ `Object` (protected)

Skip whitespace, as defined through input options or defaults

# File 'lib/ebnf/ll1/lexer.rb', line 230

def skip_whitespace
  # skip all white space, but keep track of the current line number
  while @whitespace && !scanner.eos?
    unless scanner.scan(@whitespace)
      return
    end
  end
end

#token(type, value, **options) ⇒ `Token` (protected)

Constructs a new token object annotated with the current line number.

The parser relies on the type being a symbolized URI and the value being a string, if there is no type. If there is a type, then the value takes on the native representation appropriate for that type.

Parameters:

type (Symbol)
value (String) —

Scanner instance with access to matched groups
options (Hash{Symbol => Object})

Returns:

(Token)



337
338
339

# File 'lib/ebnf/ll1/lexer.rb', line 337

def token(type, value, **options)
  Token.new(type, value, lineno: lineno, **options)
end

#valid? ⇒ `Boolean`

Returns true if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

(Boolean)

# File 'lib/ebnf/ll1/lexer.rb', line 132

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end

Class: EBNF::LL1::Lexer

Overview

Examples:

Tokenizing a Turtle string

Tokenizing and returning a token stream

Handling error conditions

Defined Under Namespace

Constant Summary

Constants included from Unescape

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Instance Attribute Details

#input ⇒ String

#options ⇒ Hash (readonly)

#scanner ⇒ StringScanner (readonly, protected)

#whitespace ⇒ Regexp (readonly)

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

.unescape_codepoints(string) ⇒ String

.unescape_string(input) ⇒ String

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

#first(*types) ⇒ Token

#lineno ⇒ Integer

#match_token(*types) ⇒ Token (protected)

#recover(*types) ⇒ Token

#shift ⇒ Token

#skip_whitespace ⇒ Object (protected)

#token(type, value, **options) ⇒ Token (protected)

#valid? ⇒ Boolean

#initialize(input = nil, terminals = nil, **options) ⇒ `Lexer`

#input ⇒ `String`

#options ⇒ `Hash` (readonly)

#scanner ⇒ `StringScanner` (readonly, protected)

#whitespace ⇒ `Regexp` (readonly)

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ `Lexer`

.unescape_codepoints(string) ⇒ `String`

.unescape_string(input) ⇒ `String`

#each_token {|token| ... } ⇒ `Enumerator` Also known as: each

#first(*types) ⇒ `Token`

#lineno ⇒ `Integer`

#match_token(*types) ⇒ `Token` (protected)

#recover(*types) ⇒ `Token`

#shift ⇒ `Token`

#skip_whitespace ⇒ `Object` (protected)

#token(type, value, **options) ⇒ `Token` (protected)

#valid? ⇒ `Boolean`