Reduce memory used by token_get_all() #4753

TysonAndre · 2019-09-28T14:44:01Z

Around a quarter of all tokens in files would have a string that's one
character long (e.g. , \, 1)

For parsing a large number of php files,
The memory increase dropped from 378374248 to 369535688 (2.5%) (other projects probably have less single-byte tokens)

Script used:

<?php
// Pass the php files to tokenize as CLI args
function main(array $contents) {
    $tokens = [];
    $start = microtime(true);
    $start_mem_virtual = memory_get_usage();
    foreach ($contents as $file_contents) {
        $tokens[] = token_get_all($file_contents);
    }
    $end_mem_virtual = memory_get_usage();
    if (false) {
        $counts = [0,0];
        foreach (array_merge(...$tokens) as $token) {
            if (is_array($token)) {
                $counts[strlen($token[1]) == 1 ? 1 : 0]++;
            }
        }
        // [not_single_byte_count, single_byte_count], e.g. [1429745,628512]
        echo json_encode($counts) . "\n";
    }
    unset($tokens);
    // Count the time needed to create and free tokens.
    $end = microtime(true);
    // printf("Increase in memory:          %d bytes\n", $end_mem - $start_mem);
    printf("Increase in memory(virtual): %d bytes\n", $end_mem_virtual - $start_mem_virtual);
    printf("Time taken to call token_get_all: %.6f seconds\n", $end - $start);
    $n = [];
}
$contents = [];
foreach (array_slice($argv, 1) as $file) {
    $contents[] = file_get_contents($file);
}
for ($i = 0; $i < 10; $i++) {
    main($contents);
}

Around a quarter of all strings in array tokens would have a string that's one character long (e.g. ` `, `\`, `1`) For parsing a large number of php files, The memory increase dropped from 378374248 to 369535688 (2.5%)

nikic · 2019-09-28T14:56:26Z

It may be worthwhile to locally intern strings used during one token_get_all run, because it's likely that quite a few strings are going to appear many times. This will have some performance cost, as we'll have to look up the string in a hashtable, but it may be worthwhile.

nikic · 2019-09-28T14:59:44Z

Though probably string size just isn't that much of a factor right now anyway, because memory usage is dominated by the array size. We should really have a mode that returns objects instead, which use much less memory (#2430).

TysonAndre · 2019-09-28T18:57:44Z

I think adding it to Php-ast as ast\Token might make it immediately available to all PHP versions to users/apps focused on performance, but that’s less broadly accessible long-term.

TysonAndre · 2019-09-28T19:11:54Z

It may be worthwhile to locally intern strings used during one token_get_all run, because it's likely that quite a few strings are going to appear many times. This will have some performance cost, as we'll have to look up the string in a hashtable, but it may be worthwhile.

I'd expect the performance cost to be noticeable (Off-topic: If an application needed to reduce memory, they could use igbinary_unserialize(igbinary_serialize(token_get_all(...))))

Igbinary has a data structure similar to what you would want to do that. In igbinary, deduplicating strings (in addition to strings used as class names) caused a noticeable hit to the amount of time taken to generate a result (i.e. serialized data). This was acceptable because unserialization was more common and reducing storage is also important.

https://github.com/igbinary/igbinary

Unserialization performance is at least on par with the standard PHP serializer, and is much faster for repetitive data. Serialization performance depends on the igbinary.compact_strings option which enables duplicate string tracking. String are inserted to a hash table, which adds some overhead when serializing. In usual scenarios this does not have much of an impact, because the typical usage pattern is "serialize rarely, unserialize often". With the compact_strings option enabled, igbinary is usually a bit slower than the standard serializer. Without it, igbinary is a bit faster.

I haven't profiled it with a C profiler, but my best guesses are:

The cache entries (or objects they use) might not be in L1/L2 cache if the number of strings is large
Branch mispredictions when traversing the cache implementation looking for matches.

I tried to optimize the original implementation in https://github.com/igbinary/igbinary/blob/9e7e8d6/src/php7/hash_si.c#L78-L130 (haven't tried other approaches such as cuckoo hashing, or existing hash sets, and would be interested if anyone knew better options).

nikic · 2019-09-28T20:19:27Z

Another possibility would be to have a per-token cache of a single string. For a lot of tokens there is only a single possible value (like "->" or "::"), while for some others one value is very likely (like "class", though "CLASS" etc can theoretically also occur). That would be a cheaper option to full interning.

(Though again, I'm not sure if this is really worthwhile without objects.)

TysonAndre · 2019-09-29T14:10:52Z

On the note of reducing memory for preg_match, token_get_all, etc.

Have there been any proposals for a built-in tuple type (i.e. fixed-length array), like in python (fixed-length, optionally immutable at the top level)? TOKEN_AS_TUPLE, PREG_MATCH_AS_TUPLE, etc. could be used

I didn't see anything in https://wiki.php.net/rfc or https://marc.info/?l=php-internals&w=2&r=1&s=tuple&q=b or https://github.com/php/php-src/pulls?q=is%3Apr+tuple (or the discussion for the PR you linked) and don't remember seeing it before
I assume implementing it would require the size of the zend_object struct plus space for the size (first zval, containing the size n) and n zval elements. (Slightly more space than a dedicated class, but still an improvement)

(In comparison with the zend_array, which takes 128 bits instead of 64 per element and must have a capacity of at least 8)

This is along the lines of what php-ds has planned for 2.0, but I haven't looked at their implementation. https://github.com/php-ds/ext-ds/blob/2.0/DRAFT.php#L521-L535

nikic · 2019-09-29T15:19:24Z

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements, but would then have to perform a pow2 check on resize. As resize is relatively rare, it may be worthwhile. For this case (array with three elements) we'd save 32*5=160 bytes out of 32*8+56=312. While not as good as an object, this would be a big saving, and would also benefit other places.

TysonAndre · 2019-09-29T16:08:10Z

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements, but would then have to perform a pow2 check on resize.

For this case (array with three elements) we'd save 32*5=160 bytes out of 32*8+56=312

Oh, good. the minimum of a capacity for 8 was to optimize just hash tables, not regular packed arrays. I wasn't sure if that would have been needed for packed arrays as well (e.g. if benchmarking found that reallocs were slower than using extra memory.)

TysonAndre · 2019-09-29T17:54:58Z

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements,

@nikic - I created TysonAndre#5 to test if this is viable, whether it'd cause problems for extensions, etc.

(This currently still rounds up to the nearest power of 2, though, but decreases the minimum size for non-packed arrays from 8 to 2. I haven't gotten around to removing the rounding)

Continue to require that associative arrays have a minimum size of 8 and are powers of 8. Stop requiring that packed arrays have a size that is a power of 2. There's no reason to for packed arrays, since php isn't doing any hash lookups with bit operations. Add a helper to create an associative array with a minimum size. This may help with clarity/avoiding unnecessary size recalculations. (performance impact may be negligible) https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html gives some background information on the `HashTable` implementation on master. This is a followup to php#4753 (comment) (I didn't see any other PRs mentioning HT_MIN_SIZE)

Continue to require that associative arrays have a minimum size of 8 and are powers of 8. This can reduce the memory usage of some applications by ~16% in applications that heavily use small arrays. Stop requiring that packed arrays have a size that is a power of 2. There's no reason to for packed arrays, since php isn't doing any hash lookups with bit operations. Add a helper to create an associative array with a minimum size. This may help with clarity/avoiding unnecessary size recalculations. (performance impact may be negligible) https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html gives some background information on the `HashTable` implementation on master. This is a followup to php#4753 (comment) (I didn't see any other PRs mentioning HT_MIN_SIZE)

Reduce memory used by token_get_all()

bc3e8ee

Around a quarter of all strings in array tokens would have a string that's one character long (e.g. ` `, `\`, `1`) For parsing a large number of php files, The memory increase dropped from 378374248 to 369535688 (2.5%)

nikic approved these changes Sep 28, 2019

View reviewed changes

php-pulls closed this in 7e0f7b7 Sep 28, 2019

TysonAndre mentioned this pull request Oct 5, 2019

[Proposal] Reduce the minimum size for packed arrays from 8 to 2 #4783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce memory used by token_get_all() #4753

Reduce memory used by token_get_all() #4753

Uh oh!

TysonAndre commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 29, 2019

Uh oh!

nikic commented Sep 29, 2019 •

edited

Loading

Uh oh!

TysonAndre commented Sep 29, 2019

Uh oh!

TysonAndre commented Sep 29, 2019 •

edited

Loading

Uh oh!

Uh oh!

Reduce memory used by token_get_all() #4753

Reduce memory used by token_get_all() #4753

Uh oh!

Conversation

TysonAndre commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 28, 2019

Uh oh!

nikic commented Sep 28, 2019

Uh oh!

TysonAndre commented Sep 29, 2019

Uh oh!

nikic commented Sep 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonAndre commented Sep 29, 2019

Uh oh!

TysonAndre commented Sep 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nikic commented Sep 29, 2019 •

edited

Loading

TysonAndre commented Sep 29, 2019 •

edited

Loading