Skip to content

Unclosed string literal not properly handled by StdLexical #397

Closed
@martingd

Description

@martingd

In scala.util.parsing.combinator.lexical.StdLexical handling of unclosed / unterminated string literals does not seem to work as expected.

The token parser declared at the top of the code looks like this and is supposed to handle unclosed string literals by returning the value ErrorToken(unclosed string literal):

def token: Parser[Token] =
  ( identChar ~ rep( identChar | digit )              ^^ { case first ~ rest => processIdent(first :: rest mkString "") }
  | digit ~ rep( digit )                              ^^ { case first ~ rest => NumericLit(first :: rest mkString "") }
  | '\'' ~ rep( chrExcept('\'', '\n', EofCh) ) ~ '\'' ^^ { case '\'' ~ chars ~ '\'' => StringLit(chars mkString "") }
  | '\"' ~ rep( chrExcept('\"', '\n', EofCh) ) ~ '\"' ^^ { case '\"' ~ chars ~ '\"' => StringLit(chars mkString "") }
  | EofCh                                             ^^^ EOF
  | '\'' ~> failure("unclosed string literal")
  | '\"' ~> failure("unclosed string literal")
  | delim
  | failure("illegal character")
  )

Here is a simple setup trying to use StdLexical:

object Lexer extends App {
    def lex(input: String) = {
        val lexer = new StdLexical
        var scanner: Reader[lexer.Token] = new lexer.Scanner(input)
        while (!scanner.atEnd) {
            println(scanner.first)
            scanner = scanner.rest
        }
    }
}

Now, calling that with legal input works, here recognising an identifier and a string literal:

> lex(""" hello "world" """)
identifier hello
"world"

Passing an illegal character also works as expected:

> lex(""" hello € "world" """)
identifier hello
ErrorToken(illegal character)
"world"

However, the rule for an unterminated double (and single) qouted string does not seem to work and the lexer produces an ErrorToken(end of input) instead of the expected ErrorToken(unclosed string literal):

> lex(""" hello € "unterminated """)
identifier hello
ErrorToken(illegal character)
ErrorToken(end of input)

I guessed the problem was that the rules for unterminated strings use the failure parser that allows backtracking but that should have sent us to the last failure("illegal character") and btw inserting a cut (~!) or alternatively using err instead of failure doesn't fix the issue.

EDIT

Parsing a string with a single quote character at the very end returns the expected token:

> lex(""" hello € """")
identifier hello
ErrorToken(illegal character)
ErrorToken(unclosed string literal)

But adding any character to the unclosed string literal causes ErrorToken(end of input) to be emitted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions