Description
In scala.util.parsing.combinator.lexical.StdLexical
handling of unclosed / unterminated string literals does not seem to work as expected.
The token parser declared at the top of the code looks like this and is supposed to handle unclosed string literals by returning the value ErrorToken(unclosed string literal)
:
def token: Parser[Token] =
( identChar ~ rep( identChar | digit ) ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
| digit ~ rep( digit ) ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }
| '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
| '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
| EofCh ^^^ EOF
| '\'' ~> failure("unclosed string literal")
| '\"' ~> failure("unclosed string literal")
| delim
| failure("illegal character")
)
Here is a simple setup trying to use StdLexical
:
object Lexer extends App {
def lex(input: String) = {
val lexer = new StdLexical
var scanner: Reader[lexer.Token] = new lexer.Scanner(input)
while (!scanner.atEnd) {
println(scanner.first)
scanner = scanner.rest
}
}
}
Now, calling that with legal input works, here recognising an identifier and a string literal:
> lex(""" hello "world" """)
identifier hello
"world"
Passing an illegal character also works as expected:
> lex(""" hello € "world" """)
identifier hello
ErrorToken(illegal character)
"world"
However, the rule for an unterminated double (and single) qouted string does not seem to work and the lexer produces an ErrorToken(end of input) instead of the expected ErrorToken(unclosed string literal):
> lex(""" hello € "unterminated """)
identifier hello
ErrorToken(illegal character)
ErrorToken(end of input)
I guessed the problem was that the rules for unterminated strings use the failure
parser that allows backtracking but that should have sent us to the last failure("illegal character")
and btw inserting a cut (~!
) or alternatively using err
instead of failure
doesn't fix the issue.
EDIT
Parsing a string with a single quote character at the very end returns the expected token:
> lex(""" hello € """")
identifier hello
ErrorToken(illegal character)
ErrorToken(unclosed string literal)
But adding any character to the unclosed string literal causes ErrorToken(end of input)
to be emitted.