Class: EBNF::LL1::Lexer
- Inherits:
-
Object
- Object
- EBNF::LL1::Lexer
- Includes:
- Unescape, Enumerable
- Defined in:
- lib/ebnf/ll1/lexer.rb
Overview
A lexical analyzer
Defined Under Namespace
Classes: Error, Terminal, Token
Constant Summary
Constants included from Unescape
Unescape::ECHAR, Unescape::ESCAPE_CHAR4, Unescape::ESCAPE_CHAR8, Unescape::ESCAPE_CHARS, Unescape::UCHAR
Instance Attribute Summary collapse
-
#input ⇒ String
The current input string being processed.
-
#options ⇒ Hash
readonly
Any additional options for the lexer.
- #scanner ⇒ StringScanner readonly protected
-
#whitespace ⇒ Regexp
readonly
Defines whitespace, including comments, otherwise whitespace must be explicit in terminals.
Class Method Summary collapse
-
.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer
Tokenizes the given
input
string or stream. -
.unescape_codepoints(string) ⇒ String
Returns a copy of the given
input
string with all\uXXXX
and\UXXXXXXXX
Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts. -
.unescape_string(input) ⇒ String
Returns a copy of the given
input
string with all string escape sequences (e.g.\n
and\t
) replaced with their unescaped UTF-8 character counterparts.
Instance Method Summary collapse
-
#each_token {|token| ... } ⇒ Enumerator
(also: #each)
Enumerates each token in the input string.
-
#first(*types) ⇒ Token
Returns first token in input stream.
-
#initialize(input = nil, terminals = nil, **options) ⇒ Lexer
constructor
Initializes a new lexer instance.
-
#lineno ⇒ Integer
The current line number (one-based).
-
#match_token(*types) ⇒ Token
protected
Return the matched token.
-
#recover(*types) ⇒ Token
Skip input until a token is matched.
-
#shift ⇒ Token
Returns first token and shifts to next.
-
#skip_whitespace ⇒ Object
protected
Skip whitespace, as defined through input options or defaults.
-
#token(type, value, **options) ⇒ Token
protected
Constructs a new token object annotated with the current line number.
-
#valid? ⇒ Boolean
Returns
true
if the input string is lexically valid.
Methods included from Unescape
Constructor Details
#initialize(input = nil, terminals = nil, **options) ⇒ Lexer
Initializes a new lexer instance.
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# File 'lib/ebnf/ll1/lexer.rb', line 94 def initialize(input = nil, terminals = nil, **) @options = .dup @whitespace = @options[:whitespace] @terminals = terminals.map do |term| if term.is_a?(Array) && term.length ==3 # Last element is options Terminal.new(term[0], term[1], **term[2]) elsif term.is_a?(Array) Terminal.new(*term) else term end end raise Error, "Terminal patterns not defined" unless @terminals && @terminals.length > 0 @scanner = Scanner.new(input, **) end |
Instance Attribute Details
#input ⇒ String
The current input string being processed.
123 124 125 |
# File 'lib/ebnf/ll1/lexer.rb', line 123 def input @input end |
#options ⇒ Hash (readonly)
Any additional options for the lexer.
117 118 119 |
# File 'lib/ebnf/ll1/lexer.rb', line 117 def @options end |
#scanner ⇒ StringScanner (readonly, protected)
226 227 228 |
# File 'lib/ebnf/ll1/lexer.rb', line 226 def scanner @scanner end |
#whitespace ⇒ Regexp (readonly)
Returns defines whitespace, including comments, otherwise whitespace must be explicit in terminals.
39 40 41 |
# File 'lib/ebnf/ll1/lexer.rb', line 39 def whitespace @whitespace end |
Class Method Details
.tokenize(input, terminals, **options) {|lexer| ... } ⇒ Lexer
Tokenizes the given input
string or stream.
77 78 79 80 |
# File 'lib/ebnf/ll1/lexer.rb', line 77 def self.tokenize(input, terminals, **, &block) lexer = self.new(input, terminals, **) block_given? ? block.call(lexer) : lexer end |
.unescape_codepoints(string) ⇒ String
Returns a copy of the given input
string with all \uXXXX
and \UXXXXXXXX
Unicode codepoint escape sequences replaced with their unescaped UTF-8 character counterparts.
49 50 51 |
# File 'lib/ebnf/ll1/lexer.rb', line 49 def self.unescape_codepoints(string) ::EBNF::Unescape.unescape_codepoints(string) end |
.unescape_string(input) ⇒ String
Returns a copy of the given input
string with all string escape sequences (e.g. \n
and \t
) replaced with their unescaped UTF-8 character counterparts.
61 62 63 |
# File 'lib/ebnf/ll1/lexer.rb', line 61 def self.unescape_string(input) ::EBNF::Unescape.unescape_string(input) end |
Instance Method Details
#each_token {|token| ... } ⇒ Enumerator Also known as: each
Enumerates each token in the input string.
146 147 148 149 150 151 152 153 |
# File 'lib/ebnf/ll1/lexer.rb', line 146 def each_token(&block) if block_given? while token = shift yield token end end enum_for(:each_token) end |
#first(*types) ⇒ Token
Returns first token in input stream
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 |
# File 'lib/ebnf/ll1/lexer.rb', line 161 def first(*types) return nil unless scanner @first ||= begin {} while !scanner.eos? && skip_whitespace return nil if scanner.eos? token = match_token(*types) if token.nil? lexme = (scanner.rest.split(@whitespace || /\s/).first rescue nil) || scanner.rest raise Error.new("Invalid token #{lexme[0..100].inspect}", input: scanner.rest[0..100], token: lexme, lineno: lineno) end token end rescue ArgumentError, Encoding::CompatibilityError => e raise Error.new(e., input: (scanner.rest[0..100] rescue '??'), token: lexme, lineno: lineno) rescue Error raise rescue STDERR.puts "Expected ArgumentError, got #{$!.class}" raise end |
#lineno ⇒ Integer
The current line number (one-based).
220 221 222 |
# File 'lib/ebnf/ll1/lexer.rb', line 220 def lineno scanner.lineno end |
#match_token(*types) ⇒ Token (protected)
Return the matched token.
If the token was matched with a case-insensitive regexp, track this with the resulting Token, so that comparisons with that token are also case insensitive
248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
# File 'lib/ebnf/ll1/lexer.rb', line 248 def match_token(*types) @terminals.each do |term| next unless types.empty? || types.include?(term.type) #STDERR.puts "match[#{term.type}] #{scanner.rest[0..100].inspect} against #{term.regexp.inspect}" #if term.type == :STRING_LITERAL_SINGLE_QUOTE if term.partial_regexp && scanner.match?(term.partial_regexp) && !scanner.match?(term.regexp) && scanner.respond_to?(:ensure_buffer_full) scanner.ensure_buffer_full end if matched = scanner.scan(term.regexp) #STDERR.puts " matched #{term.type.inspect}: #{matched.inspect}" tok = token(term.type, term.canonicalize(matched)) return tok end end nil end |
#recover(*types) ⇒ Token
Skip input until a token is matched
203 204 205 206 207 208 209 210 211 212 213 214 |
# File 'lib/ebnf/ll1/lexer.rb', line 203 def recover(*types) until scanner.eos? || tok = match_token(*types) if scanner.skip_until(@whitespace || /\s+/m).nil? # Skip past current "token" # No whitespace at the end, must be and end of string scanner.terminate else skip_whitespace end end scanner.unscan if tok first end |
#shift ⇒ Token
Returns first token and shifts to next
192 193 194 195 196 |
# File 'lib/ebnf/ll1/lexer.rb', line 192 def shift cur = first @first = nil cur end |
#skip_whitespace ⇒ Object (protected)
Skip whitespace, as defined through input options or defaults
230 231 232 233 234 235 236 237 |
# File 'lib/ebnf/ll1/lexer.rb', line 230 def skip_whitespace # skip all white space, but keep track of the current line number while @whitespace && !scanner.eos? unless scanner.scan(@whitespace) return end end end |
#token(type, value, **options) ⇒ Token (protected)
Constructs a new token object annotated with the current line number.
The parser relies on the type being a symbolized URI and the value being a string, if there is no type. If there is a type, then the value takes on the native representation appropriate for that type.
337 338 339 |
# File 'lib/ebnf/ll1/lexer.rb', line 337 def token(type, value, **) Token.new(type, value, lineno: lineno, **) end |
#valid? ⇒ Boolean
Returns true
if the input string is lexically valid.
To be considered valid, the input string must contain more than zero terminals, and must not contain any invalid terminals.
132 133 134 135 136 137 138 |
# File 'lib/ebnf/ll1/lexer.rb', line 132 def valid? begin !count.zero? rescue Error false end end |