Skip to content

Recursion limits hit more often in PHP 7.3+ #333

Open
@michaelbutler

Description

@michaelbutler

With the major release of PHP 7.3 there were some big changes under the hood with the regular expression engine, going from PCRE to PCRE2. While it is not explicitly called out in the changes, I noticed that this new PCRE2 engine is a lot more susceptible to the PREG_RECURSION_LIMIT_ERROR error when using MarkdownExtra. Let's dig in to the details.

pcre.recursion_limit is the INI setting that controls:

(Default 100000) PCRE's recursion limit. Please note that if you set this value to a high number you may consume all the available process stack and eventually crash PHP (due to reaching the stack size limit imposed by the Operating System).

In PHP 7.2 and below, we had actually decreased the 100,000 default to 10,000 for performance reasons (to stop out-of-control user content from slowing down PHP). But in PHP 7.3, having this 10,000 number now fails on pieces of content that did not fail before. This led me to think that PHP 7.3 is more susceptible to this.

What happens when the recursion limit is hit?

Simple, the preg_split call returns false, which this library doesn't expect, so the code fails at a later line due to unexpected data type (false is not Countable, also a fairly recent & strict PHP error).

Example Content

But what specifically is in the content that causes us to hit this limit? It turns out that it has to do with dangling < in the markdown -- for example, <3 for a heart symbol, or doing a math comparison like 4 < 8 or really any other < that is not part of an HTML tag. That alone isn't a problem, you also need to have "a lot" of content after the <. I tested truncating the markdown to various lengths and it wouldn't fail until it was sufficiently long enough. For the INI setting of 10,000 this failed at a document length of around 5k chars where the <3 was near the top. In fact I'll attach the file here testpost.txt. to test, run with:

echo "<?php  \Michelf\MarkdownExtra::defaultTransform(file_get_contents('testpost.txt'));" > testing.php;
php -dauto_prepend_file=vendor/autoload.php -dpcre.recursion_limit=10000 testing.php

If I understand the code correctly, it seems that it is trying to find HTML tags by looking for the starting < and then trying to find the ending >. But what if there never is an ending >? That seems to be our problem. I noticed that if I preprocessed the file to replace <3 with &lt;3, the problem went away and we did not hit the recursion limit error.

Solutions

So what can we do here? For starters as a bare minimum, I'd like MarkdownExtra to detect regular expression errors such as this, and throw them as exceptions, so at least we can know what went wrong instead of the code failing at some arbitrary spot. I have an example prototype here demonstrating this which I could open as a PR. I noticed that this library does not throw exceptions at all, so this would be a first; I'm not sure if there is a more preferred way of handling this kind of error, such as maybe returning a blank string, or maybe returning the original text unmodified.

In addition to a change like that (or instead of), I could simply just increase the recursion limit back to PHP's default of 100,000 and observe the performance, as PHP has gotten a lot faster since we originally lowered the setting.

However, I also wonder if some algorithm improvement could be made in Markdown to be able to detect dangling < such as these which are clearly not HTML tags, so it doesn't end up exhausting the recursion depth. I think even with the default of 100,000 people will start running into this issue more with PHP 7.3+ for especially large files.

Would love your thoughts on this! Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions