Description
Our application - https://github.com/bdring/Grbl_Esp32 - works fine on v1.0.4, but loses Serial Rx bytes on every version after that, starting at 1.0.5-rc1. In developer testing, the loss of data typically occurs about 15 minutes into a 20 minute test run involving sending a 600KB GCode file sent via serial from third-party GCode sending programs that are very well tested. Multiple users have reported much quicker failures.
When we or our users compile against v1.0.4, the problems disappear. The problems appear when compiling against any variant of 1.0.5 or 1.0.6. It is the same with Arduino IDE and PlatformIO - if the Arduino ESP32 framework is v1.0.4, it works, and any later version fails.
The nature of the problem is a lost byte. In every instance that we have analyzed in depth, the lost byte is the newline at the end of a line.
I have isolated the cause of the problem to the rx portion of the patch in #3664 . The attempted fix to that patch - #3713 - does not solve the problem. The 3713 fix changes the failure mode, by eliminating the interchange of two characters at the expense of somewhat-less-frequent lost bytes.
The smallest fix that I have found is to simply revert this change from the 3664 patch:
- return uxQueueMessagesWaiting(uart->queue);
+ return (uxQueueMessagesWaiting(uart->queue) + uart->dev->status.rxfifo_cnt) ;
That reversion suffices to make our application work reliably, but I do not believe that it is completely correct. I think that all of the code that uses uartRxFifoToQueue() should also be removed. Reaching around the ISR routine from uartRead() and uartPeek() - even when interrupts are disabled temporarily - is a recipe for obscure bugs, which we are experiencing. I think that the reverted return line has the effect of reducing the likelihood that uartRxFifoToQueue() is called, thus reducing, but probably not entirely eliminating, the possibility of lost data.
PR 3664 was intended to reduce Rx latency. I think a much better solution to that goal would be to make the FIFO and TOUT thresholds configurable, preserving the classic driver architecture where the ISR is the only writer to the queue.
I can produce a PR but I first wanted to document and explain the problem.