Skip to content

Reduce memory used by token_get_all() #4753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

TysonAndre
Copy link
Contributor

Around a quarter of all tokens in files would have a string that's one
character long (e.g. , \, 1)

For parsing a large number of php files,
The memory increase dropped from 378374248 to 369535688 (2.5%) (other projects probably have less single-byte tokens)

Script used:

<?php
// Pass the php files to tokenize as CLI args
function main(array $contents) {
    $tokens = [];
    $start = microtime(true);
    $start_mem_virtual = memory_get_usage();
    foreach ($contents as $file_contents) {
        $tokens[] = token_get_all($file_contents);
    }
    $end_mem_virtual = memory_get_usage();
    if (false) {
        $counts = [0,0];
        foreach (array_merge(...$tokens) as $token) {
            if (is_array($token)) {
                $counts[strlen($token[1]) == 1 ? 1 : 0]++;
            }
        }
        // [not_single_byte_count, single_byte_count], e.g. [1429745,628512]
        echo json_encode($counts) . "\n";
    }
    unset($tokens);
    // Count the time needed to create and free tokens.
    $end = microtime(true);
    // printf("Increase in memory:          %d bytes\n", $end_mem - $start_mem);
    printf("Increase in memory(virtual): %d bytes\n", $end_mem_virtual - $start_mem_virtual);
    printf("Time taken to call token_get_all: %.6f seconds\n", $end - $start);
    $n = [];
}
$contents = [];
foreach (array_slice($argv, 1) as $file) {
    $contents[] = file_get_contents($file);
}
for ($i = 0; $i < 10; $i++) {
    main($contents);
}

Around a quarter of all strings in array tokens would have a string that's one
character long (e.g. ` `, `\`, `1`)

For parsing a large number of php files,
The memory increase dropped from 378374248 to 369535688 (2.5%)
@nikic
Copy link
Member

nikic commented Sep 28, 2019

It may be worthwhile to locally intern strings used during one token_get_all run, because it's likely that quite a few strings are going to appear many times. This will have some performance cost, as we'll have to look up the string in a hashtable, but it may be worthwhile.

@nikic
Copy link
Member

nikic commented Sep 28, 2019

Though probably string size just isn't that much of a factor right now anyway, because memory usage is dominated by the array size. We should really have a mode that returns objects instead, which use much less memory (#2430).

@TysonAndre
Copy link
Contributor Author

I think adding it to Php-ast as ast\Token might make it immediately available to all PHP versions to users/apps focused on performance, but that’s less broadly accessible long-term.

@TysonAndre
Copy link
Contributor Author

It may be worthwhile to locally intern strings used during one token_get_all run, because it's likely that quite a few strings are going to appear many times. This will have some performance cost, as we'll have to look up the string in a hashtable, but it may be worthwhile.

I'd expect the performance cost to be noticeable (Off-topic: If an application needed to reduce memory, they could use igbinary_unserialize(igbinary_serialize(token_get_all(...))))

Igbinary has a data structure similar to what you would want to do that. In igbinary, deduplicating strings (in addition to strings used as class names) caused a noticeable hit to the amount of time taken to generate a result (i.e. serialized data). This was acceptable because unserialization was more common and reducing storage is also important.

https://github.com/igbinary/igbinary

Unserialization performance is at least on par with the standard PHP serializer, and is much faster for repetitive data. Serialization performance depends on the igbinary.compact_strings option which enables duplicate string tracking. String are inserted to a hash table, which adds some overhead when serializing. In usual scenarios this does not have much of an impact, because the typical usage pattern is "serialize rarely, unserialize often". With the compact_strings option enabled, igbinary is usually a bit slower than the standard serializer. Without it, igbinary is a bit faster.

I haven't profiled it with a C profiler, but my best guesses are:

  1. The cache entries (or objects they use) might not be in L1/L2 cache if the number of strings is large
  2. Branch mispredictions when traversing the cache implementation looking for matches.

I tried to optimize the original implementation in https://github.com/igbinary/igbinary/blob/9e7e8d6/src/php7/hash_si.c#L78-L130 (haven't tried other approaches such as cuckoo hashing, or existing hash sets, and would be interested if anyone knew better options).

@php-pulls php-pulls closed this in 7e0f7b7 Sep 28, 2019
@nikic
Copy link
Member

nikic commented Sep 28, 2019

Another possibility would be to have a per-token cache of a single string. For a lot of tokens there is only a single possible value (like "->" or "::"), while for some others one value is very likely (like "class", though "CLASS" etc can theoretically also occur). That would be a cheaper option to full interning.

(Though again, I'm not sure if this is really worthwhile without objects.)

@TysonAndre
Copy link
Contributor Author

On the note of reducing memory for preg_match, token_get_all, etc.

Have there been any proposals for a built-in tuple type (i.e. fixed-length array), like in python (fixed-length, optionally immutable at the top level)? TOKEN_AS_TUPLE, PREG_MATCH_AS_TUPLE, etc. could be used

This is along the lines of what php-ds has planned for 2.0, but I haven't looked at their implementation. https://github.com/php-ds/ext-ds/blob/2.0/DRAFT.php#L521-L535

@nikic
Copy link
Member

nikic commented Sep 29, 2019

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements, but would then have to perform a pow2 check on resize. As resize is relatively rare, it may be worthwhile. For this case (array with three elements) we'd save 32*5=160 bytes out of 32*8+56=312. While not as good as an object, this would be a big saving, and would also benefit other places.

@TysonAndre
Copy link
Contributor Author

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements, but would then have to perform a pow2 check on resize.

For this case (array with three elements) we'd save 32*5=160 bytes out of 32*8+56=312

Oh, good. the minimum of a capacity for 8 was to optimize just hash tables, not regular packed arrays. I wasn't sure if that would have been needed for packed arrays as well (e.g. if benchmarking found that reallocs were slower than using extra memory.)

@TysonAndre
Copy link
Contributor Author

TysonAndre commented Sep 29, 2019

While we're speculating... we could also optimize the array implementation for the tuple case. In particular, when packed arrays are used, there is no strong reason to require a power of two array size. We could start off with an exact number of elements,

@nikic - I created TysonAndre#5 to test if this is viable, whether it'd cause problems for extensions, etc.

(This currently still rounds up to the nearest power of 2, though, but decreases the minimum size for non-packed arrays from 8 to 2. I haven't gotten around to removing the rounding)

TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Oct 5, 2019
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Oct 5, 2019
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Dec 5, 2019
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Feb 16, 2020
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Mar 21, 2020
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Jul 1, 2020
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Nov 25, 2021
Continue to require that associative arrays have a minimum
size of 8 and are powers of 8.

This can reduce the memory usage of some applications by ~16%
in applications that heavily use small arrays.

Stop requiring that packed arrays have a size that is a power of 2.
There's no reason to for packed arrays,
since php isn't doing any hash lookups with bit operations.

Add a helper to create an associative array with a minimum size.
This may help with clarity/avoiding unnecessary size recalculations.
(performance impact may be negligible)

https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html
gives some background information on the `HashTable` implementation on
master.

This is a followup to
php#4753 (comment)
(I didn't see any other PRs mentioning HT_MIN_SIZE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants