-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Proposal: Add final class Vector
to PHP
#7488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As this type/class is proposed as final, is there any strong reason for an extra type/class than implementing this as an optimization of existing This way, developer would not have to care, a lot of exisitng apps will be faster, will use less memory, ... I see only positives. |
Adding a new type to php as a non-class is a massive undertaking for php-src itself and extension authors. It would not work with a lot of existing code that handled arrays and objects - I expect that See https://www.npopov.com/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html for how php represents values internally.
Additionally, adding a class doesn't prevent adding a Also, even if that were done, |
} | ||
|
||
return NULL; | ||
return zend_interned_string_ht_lookup_ex(ZSTR_H(str), ZSTR_VAL(str), ZSTR_LEN(str), interned_strings); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikic FYI - I wonder if you're able to reproduce the test failure in the prior commit (in gcc 9.3.0-17ubuntu1~20.04
) for make test TESTS='ext/spl/tests/Vector/aggregate.phpt -m --show-mem --show-diff
(it segfaults in idx = Z_NEXT(p->val);
when compiling the code and interning the strings, not when running it). It seems almost definitely related to running the code under valgrind with -O2 [-g]
. Even after git clean -fdx
and recompiling it's still an issue
- Oddly, running valgrind tests in docker in other OSes (e.g. dockerhub php:8.1.0RC1 with
gcc (Debian 10.2.1-6) 10.2.1 20210110
) don't have this issue. - When I add
fprintf(stderr
statements, the issue goes away
(The segfault doesn't happen with gdb, or even gdb with USE_ZEND_ALLOC=0
- it's baffling.)
- If you are able to reproduce this, it's probably a symptom of a larger issue that may affect other php 8.1/8.2 users using the same compiler
- If you aren't, it still may be a good idea to avoid (1) repeating code and (2)
redundant loads of ZSTR_LEN(str) from memory into registers if gcc can't infer that the assembly function zend_string_equal_val doesn't modify memory in the assembly block, and needs to scan over multiple buckets(probably not redundant if the hash is the same, hash collision chance is extremely low)
Options used to build in linux mint 20.2: export CFLAGS='-O2 -g'; ./buildconf --force; ./configure --disable-all --with-zlib --with-zip --enable-phar --with-bz2 --enable-opcache --with-readline --enable-tokenizer --prefix=$HOME/php-8.2-unoptimized-nts-vector-install --with-valgrind; make -j8
See spl_vector.stub.php for the userland API. Earlier work on the implementation can be found at https://github.com/TysonAndre/pecl-teds and it can be tested out at https://pecl.php.net/package/teds as `Teds\Vector` (currently the same apart from reusing spl's conversion of mixed to `int $offset`) This was originally based on spl_fixedarray.c and previous work I did on an RFC. Notable features: - Roughly half the memory usage of (non-constant) arrays due to not needing a list of indexes of buckets separately from the zval buckets themselves. - Same memory usage as SplFixedArray in 8.2 for a given **capacity** - Lower memory usage and better performance than SplDoublyLinkedList or its subclasses (SplStack) due to being an array instead of a linked list - More efficient resizing than SplFixedArray's setSize(getSize+1) ```php final class Vector implements IteratorAggregate, Countable, JsonSerializable, ArrayAccess { public function __construct( iterable $iterator = [], bool $preserveKeys = true ) {} public function getIterator(): InternalIterator {} public function count(): int {} public function capacity(): int {} public function clear(): void {} public function setSize(int $size): void {} public function __serialize(): array {} public function __unserialize(array $data): void {} public static function __set_state(array $array): Vector {} public function push(mixed $value): void {} public function pop(): mixed {} public function toArray(): array {} // Strictly typed, unlike offsetGet/offsetSet public function valueAt(int $offset): mixed {} public function setValueAt(int $offset, mixed $value): void {} public function offsetGet(mixed $offset): mixed {} public function offsetExists(mixed $offset): bool {} public function offsetSet(mixed $offset, mixed $value): void {} // Throws because unset and null are different things. public function offsetUnset(mixed $offset): void {} public function indexOf(mixed $value): int|false {} public function contains(mixed $value): bool {} public function shrinkToFit(): void {} public function jsonSerialize(): array {} } ```
I'm guessing this is a bug seen when inlining and one function using assembly and the other not using assembly related to `zend_string_equal_content`? At lower optimization levels it passes. (Environment: gcc 9.3 on Linux) `make test TESTS='ext/spl/tests/Vector/aggregate.phpt -m --show-mem --show-diff` fails while compiling the variable in the foreach.
I would be suitable for using a different name for \Vector, which is basically also wrongly named in C++. Quote from stackoverflow: |
@TysonAndre Is there any progress on this RFC? |
Closing this in favor of the Deque rfc
|
See spl_vector.stub.php for the userland API.
RFC: https://wiki.php.net/rfc/vector
Discussion: https://externals.io/message/116048
Planned changes:
?int
instead ofint|false
Add setCapacity/reserve method, ignore or adjust requested capacities that are too small https://externals.io/message/116048#116074map
/filter
Earlier work on the implementation can be found at
https://github.com/TysonAndre/pecl-teds
and it can be tested out with https://pecl.php.net/package/teds as
Teds\Vector
(almost the same apart from the name and
\Vector
reusing spl's conversion of mixed toint $offset
)This was originally based on spl_fixedarray.c and previous work I did on an RFC.
Notable features of
Vector
:Roughly half the memory usage of (non-constant) arrays due to
not needing the hash or key that's maintained even when not needed in array entries
(https://www.npopov.com/2014/12/22/PHPs-new-hashtable-implementation.html#packed-hashtables)
and not needing a minimum size of 8 or powers of 2 (technically avoidable in
array
but not done so far due to tradeoffs).This may be useful in applications that need to load a lot of data,
or in reducing the memory usage of long-running php processes.
Same memory usage as SplFixedArray in 8.2 for a given capacity
Lower memory usage and better performance than SplDoublyLinkedList
or its subclasses (SplStack) due to being internally represented as a
C array instead of a linked list
More efficient resizing than SplFixedArray's setSize(getSize+1)
Support
$vector[] = $element
, like ArrayObject.Allow enforcing that a list of values really is a list without gaps or string keys
Having this functionality in php itself rather than a third party extension would encourage wider adoption of this
Backwards incompatible changes:
\Vector
in the global namespace would cause a compile error due to this classnow being declared internally.
Benchmark
This is a contrived benchmark for estimating performance of building/reading variable-sized arrays of different sizes when the final size would be unknown (it is known here).
Notably,
Vector
is more memory and/or time efficient than the other object data structures, and there are times when you may prefer to pass an object collection rather than an array around (e.g. collecting the values in an unbalanced binary tree) or to allow the caller to modify a passed in collection while ensuring the collection is not changed to a different type by reference, or that gaps are not introduced at runtime.Read time is counted separately from create+destroy time. This is a total over all iterations, and the instrumentation adds to the time needed.
SplFixedArray doesn't have a
push()
method, and this benchmark would be faster if it did.SplStack uses
foreach
for read benchmarking because the random access of SplStack isO(n)
(linear time) in a linked list.All of the data structures in this benchmark make efficient use of memory for powers of 2, though array has a minimum capacity of 8
NOTE: At the moment,
array
is (at the moment) faster at the cost of memory usageEDIT: benchmarks for the
array
type will need to be redone if other proposed optimizations forarray
memory usage are approved for 8.2 and merged without issues