Class: EBNF::LL1::Lexer

Inherits:
Object
  • Object
show all
Includes:
Unescape, Enumerable
Defined in:
lib/ebnf/ll1/lexer.rb

Overview

A lexical analyzer

Examples:

Tokenizing a Turtle string

terminals = [
  [:BLANK_NODE_LABEL, %r(_:(#{PN_LOCAL}))],
  ...
]
ttl = "@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ."
lexer = EBNF::LL1::Lexer.tokenize(ttl, terminals)
lexer.each_token do |token|
  puts token.inspect
end

Tokenizing and returning a token stream

lexer = EBNF::LL1::Lexer.tokenize(...)
while :some-condition
  token = lexer.first # Get the current token
  token = lexer.shift # Get the current token and shift to the next
end

Handling error conditions

begin
  EBNF::LL1::Lexer.tokenize(query)
rescue EBNF::LL1::Lexer::Error => error
  warn error.inspect
end

See Also:

Defined Under Namespace

Classes: Error, Terminal, Token

Constant Summary

Constants included from Unescape

Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Methods included from Unescape

unescape

Constructor Details

#initialize(input = nil, terminals = nil, **options) ⇒ Lexer

Initializes a new lexer instance.

Parameters:

  • input (String, #to_s) (defaults to: nil)
  • terminals (Array<Array<Symbol, Regexp>, Terminal>) (defaults to: nil)

    Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.

  • options (Hash{Symbol => Object})
  • options[Integer] (Hash)

    a customizable set of options

Options Hash (**options):

  • :whitespace (Regexp)

    Whitespace between tokens, including comments

Raises:



94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# File 'lib/ebnf/ll1/lexer.rb', line 94

def initialize(input = nil, terminals = nil, **options)
  @options        = options.dup
  @whitespace     = @options[:whitespace]
  @terminals      = terminals.map do |term|
    if term.is_a?(Array) && term.length ==3
      # Last element is options
      Terminal.new(term[0], term[1], **term[2])
    elsif term.is_a?(Array)
      Terminal.new(*term)
    else
      term
    end
  end

  raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0

  @scanner = Scanner.new(input, **options)
end

Instance Attribute Details

#inputString

The current input string being processed.

Returns:

  • (String)


123
124
125
# File 'lib/ebnf/ll1/lexer.rb', line 123

def input
  @input
end

#optionsHash (readonly)

Any additional options for the lexer.

Returns:

  • (Hash)


117
118
119
# File 'lib/ebnf/ll1/lexer.rb', line 117

def options
  @options
end

#scannerStringScanner (readonly, protected)

Returns:

  • (StringScanner)


226
227
228
# File 'lib/ebnf/ll1/lexer.rb', line 226

def scanner
  @scanner
end

#whitespaceRegexp (readonly)

Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.

Returns:

  • (Regexp)

    defines whitespace, including comments, otherwise whitespace must be explicit in terminals



39
40
41
# File 'lib/ebnf/ll1/lexer.rb', line 39

def whitespace
  @whitespace
end

Class Method Details

.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer

Tokenizes the given input string or stream.

Parameters:

  • input (String, #to_s)
  • terminals (Array<Array<Symbol, Regexp>>)

    Array of symbol, regexp pairs used to match terminals. If the symbol is nil, it defines a Regexp to match string terminals.

  • options (Hash{Symbol => Object})

Yields:

  • (lexer)

Yield Parameters:

Returns:

Raises:



77
78
79
80
# File 'lib/ebnf/ll1/lexer.rb', line 77

def self.tokenize(input, terminals, **options, &block)
  lexer = self.new(input, terminals, **options)
  block_given? ? block.call(lexer) : lexer
end

.unescape_codepoints(string) ⇒ String

Returns a copy of the given input string with all \uXXXX and \UXXXXXXXX Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.

Parameters:

  • string (String)

Returns:

  • (String)

See Also:



49
50
51
# File 'lib/ebnf/ll1/lexer.rb', line 49

def self.unescape_codepoints(string)
  ::EBNF::Unescape.unescape_codepoints(string)
end

.unescape_string(input) ⇒ String

Returns a copy of the given input string with all string escape sequences (e.g. \n and \t) replaced with their unescaped UTF-8 character counterparts.

Parameters:

  • input (String)

Returns:

  • (String)

See Also:



61
62
63
# File 'lib/ebnf/ll1/lexer.rb', line 61

def self.unescape_string(input)
  ::EBNF::Unescape.unescape_string(input)
end

Instance Method Details

#each_token {|token| ... } ⇒ Enumerator Also known as: each

Enumerates each token in the input string.

Yields:

Yield Parameters:

Returns:

  • (Enumerator)


146
147
148
149
150
151
152
153
# File 'lib/ebnf/ll1/lexer.rb', line 146

def each_token(&block)
  if block_given?
    while token = shift
      yield token
    end
  end
  enum_for(:each_token)
end

#first(*types) ⇒ Token

Returns first token in input stream

Parameters:

  • types (Array[Symbol])

    Optional set of types for restricting terminals examined

Returns:



161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
# File 'lib/ebnf/ll1/lexer.rb', line 161

def first(*types)
  return nil unless scanner

  @first ||= begin
    {} while !scanner.eos? && skip_whitespace
    return nil if scanner.eos?

    token = match_token(*types)

    if token.nil?
      lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest
      raise Error.new("Invalid token #{lexme[0..100].inspect}",
        input: scanner.rest[0..100], token: lexme, lineno: lineno)
    end

    token
  end
rescue ArgumentError, Encoding::CompatibilityError => e
  raise Error.new(e.message,
    input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno)
rescue Error
  raise
rescue
  STDERR.puts "Expected ArgumentError, got #{$!.class}"
  raise
end

#linenoInteger

The current line number (one-based).

Returns:

  • (Integer)


220
221
222
# File 'lib/ebnf/ll1/lexer.rb', line 220

def lineno
  scanner.lineno
end

#match_token(*types) ⇒ Token (protected)

Return the matched token.

If the token was matched with a case-insensitive regexp, track this with the resulting Token, so that comparisons with that token are also case insensitive

Parameters:

  • types (Array[Symbol])

    Optional set of types for restricting terminals examined

Returns:



248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
# File 'lib/ebnf/ll1/lexer.rb', line 248

def match_token(*types)
  @terminals.each do |term|
    next unless types.empty? || types.include?(term.type)
    #STDERR.puts "match[#{term.type}] #{scanner.rest[0..100].inspect} against #{term.regexp.inspect}" #if term.type == :STRING_LITERAL_SINGLE_QUOTE
    if term.partial_regexp && scanner.match?(term.partial_regexp) && !scanner.match?(term.regexp) && scanner.respond_to?(:ensure_buffer_full)
      scanner.ensure_buffer_full
    end

    if matched = scanner.scan(term.regexp)
      #STDERR.puts "  matched #{term.type.inspect}: #{matched.inspect}"
      tok = token(term.type, term.canonicalize(matched))
      return tok
    end
  end
  nil
end

#recover(*types) ⇒ Token

Skip input until a token is matched

Parameters:

  • types (Array[Symbol])

    Optional set of types for restricting terminals examined

Returns:



203
204
205
206
207
208
209
210
211
212
213
214
# File 'lib/ebnf/ll1/lexer.rb', line 203

def recover(*types)
  until scanner.eos? || tok = match_token(*types)
    if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token"
      # No whitespace at the end, must be and end of string
      scanner.terminate
    else
      skip_whitespace
    end
  end
  scanner.unscan if tok
  first
end

#shiftToken

Returns first token and shifts to next

Returns:



192
193
194
195
196
# File 'lib/ebnf/ll1/lexer.rb', line 192

def shift
  cur = first
  @first = nil
  cur
end

#skip_whitespaceObject (protected)

Skip whitespace, as defined through input options or defaults



230
231
232
233
234
235
236
237
# File 'lib/ebnf/ll1/lexer.rb', line 230

def skip_whitespace
  # skip all white space, but keep track of the current line number
  while @whitespace && !scanner.eos?
    unless scanner.scan(@whitespace)
      return
    end
  end
end

#token(type, value, **options) ⇒ Token (protected)

Constructs a new token object annotated with the current line number.

The parser relies on the type being a symbolized URI and the value being a string, if there is no type. If there is a type, then the value takes on the native representation appropriate for that type.

Parameters:

  • type (Symbol)
  • value (String)

    Scanner instance with access to matched groups

  • options (Hash{Symbol => Object})

Returns:



337
338
339
# File 'lib/ebnf/ll1/lexer.rb', line 337

def token(type, value, **options)
  Token.new(type, value, lineno: lineno, **options)
end

#valid?Boolean

Returns true if the input string is lexically valid.

To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.

Returns:

  • (Boolean)


132
133
134
135
136
137
138
# File 'lib/ebnf/ll1/lexer.rb', line 132

def valid?
  begin
    !count.zero?
  rescue Error
    false
  end
end