Skip to content

Proposal: Add final class Vector to PHP #7488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

TysonAndre
Copy link
Contributor

@TysonAndre TysonAndre commented Sep 12, 2021

See spl_vector.stub.php for the userland API.

RFC: https://wiki.php.net/rfc/vector
Discussion: https://externals.io/message/116048


Planned changes:

  • Make indexOf ?int instead of int|false
  • Add optional padding value to setSize to allow extending arrays with values other than null, like C++/Rust
  • Add setCapacity/reserve method, ignore or adjust requested capacities that are too small https://externals.io/message/116048#116074
  • Add method documentation and throws documentation to RFC
  • Add example uses to RFC
  • Add isEmpty
  • Add map/filter
  • Make push variadic, like array_push

Earlier work on the implementation can be found at
https://github.com/TysonAndre/pecl-teds
and it can be tested out with https://pecl.php.net/package/teds as Teds\Vector
(almost the same apart from the name and \Vector reusing spl's conversion of mixed to int $offset)

This was originally based on spl_fixedarray.c and previous work I did on an RFC.

Notable features of Vector:

  • Roughly half the memory usage of (non-constant) arrays due to
    not needing the hash or key that's maintained even when not needed in array entries
    (https://www.npopov.com/2014/12/22/PHPs-new-hashtable-implementation.html#packed-hashtables)
    and not needing a minimum size of 8 or powers of 2 (technically avoidable in array but not done so far due to tradeoffs).

    This may be useful in applications that need to load a lot of data,
    or in reducing the memory usage of long-running php processes.

  • Same memory usage as SplFixedArray in 8.2 for a given capacity

  • Lower memory usage and better performance than SplDoublyLinkedList
    or its subclasses (SplStack) due to being internally represented as a
    C array instead of a linked list

  • More efficient resizing than SplFixedArray's setSize(getSize+1)

  • Support $vector[] = $element, like ArrayObject.

  • Allow enforcing that a list of values really is a list without gaps or string keys

  • Having this functionality in php itself rather than a third party extension would encourage wider adoption of this

Backwards incompatible changes:

  • Userland classlikes named \Vector in the global namespace would cause a compile error due to this class
    now being declared internally.
/**
 * A Vector is a container of a sequence of values (with keys 0, 1, ...count($vector) - 1)
 * that can change in size.
 *
 * This is backed by a memory-efficient representation
 * (raw array of values) and provides fast access (compared to other objects in the Spl)
 * and constant amortized-time push/pop operations.
 *
 * Attempting to read or write values outside of the range of values with `*get`/`*set` methods will throw at runtime.
 */
final class Vector implements IteratorAggregate, Countable, JsonSerializable, ArrayAccess
{
    /**
     * Construct a Vector from an iterable.
     *
     * The keys will be ignored, and values will be reindexed without gaps starting from 0
     */
    public function __construct(iterable $iterator = []) {}
    /**
     * Returns an iterator that will return the indexes and values of iterable until index >= count()
     */
    public function getIterator(): InternalIterator {}
    /**
     * Returns the number of values in this Vector
     */
    public function count(): int {}
    /**
     * Reduces the Vector's capacity to its size, freeing any extra unused memory.
     */
    public function shrinkToFit(): void {}
    /**
     * Remove all elements from the array and free all reserved capacity.
     */
    public function clear(): void {}

    /**
     * If $size is greater than the current size, raise the size and fill the free space with $value
     * If $size is less than the current size, reduce the size and discard elements.
     */
    public function setSize(int $size, mixed $value = null): void {}

    public function __serialize(): array {}
    public function __unserialize(array $data): void {}
    public static function __set_state(array $array): Vector {}

    public function push(mixed ...$values): void {}
    public function pop(): mixed {}

    public function toArray(): array {}
    // Strictly typed, unlike offsetGet/offsetSet
    public function get(int $offset): mixed {}
    public function set(int $offset, mixed $value): void {}

    /**
     * Returns the value at (int)$offset.
     * @throws OutOfBoundsException if the value of (int)$offset is not within the bounds of this vector
     */
    public function offsetGet(mixed $offset): mixed {}

    /**
     * Returns true if `0 <= (int)$offset && (int)$offset < $this->count().
     */
    public function offsetExists(mixed $offset): bool {}

    /**
     * Sets the value at offset (int)$offset to $value
     * @throws \OutOfBoundsException if the value of (int)$offset is not within the bounds of this vector
     */
    public function offsetSet(mixed $offset, mixed $value): void {}

    /**
     * @throws \RuntimeException unconditionally because unset and null are different things, unlike SplFixedArray
     */
    public function offsetUnset(mixed $offset): void {}

    /**
     * Returns the offset of a value that is === $value, or returns null.
     */
    public function indexOf(mixed $value): ?int {}
    /**
     * @return bool true if there exists a value === $value in this vector.
     */
    public function contains(mixed $value): bool {}

    /**
     * Returns a new Vector instance created from the return values of $callable($element)
     * being applied to each element of this vector.
     *
     * (at)param null|callable(mixed):mixed $callback
     */
    public function map(callable $callback): Vector {}
    /**
     * Returns the subset of elements of the Vector satisfying the predicate.
     *
     * If the value returned by the callback is truthy
     * (e.g. true, non-zero number, non-empty array, truthy object, etc.),
     * this is treated as satisfying the predicate.
     *
     * (at)param null|callable(mixed):bool $callback
     */
    public function filter(?callable $callback = null): Vector {}

    public function jsonSerialize(): array {}
}

Benchmark

This is a contrived benchmark for estimating performance of building/reading variable-sized arrays of different sizes when the final size would be unknown (it is known here).

Notably, Vector is more memory and/or time efficient than the other object data structures, and there are times when you may prefer to pass an object collection rather than an array around (e.g. collecting the values in an unbalanced binary tree) or to allow the caller to modify a passed in collection while ensuring the collection is not changed to a different type by reference, or that gaps are not introduced at runtime.

Read time is counted separately from create+destroy time. This is a total over all iterations, and the instrumentation adds to the time needed.

SplFixedArray doesn't have a push() method, and this benchmark would be faster if it did.
SplStack uses foreach for read benchmarking because the random access of SplStack is O(n) (linear time) in a linked list.

All of the data structures in this benchmark make efficient use of memory for powers of 2, though array has a minimum capacity of 8

NOTE: At the moment, array is (at the moment) faster at the cost of memory usage

EDIT: benchmarks for the array type will need to be redone if other proposed optimizations for array memory usage are approved for 8.2 and merged without issues

<?php

function bench_array(int $n, int $iterations) {
    $totalReadTime = 0.0;
    $startTime = hrtime(true);
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = [];
        for ($i = 0; $i < $n; $i++) {
            $values[] = $i;
        }
        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;

        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);

    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to array:         n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
function bench_vector(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new Vector();
        for ($i = 0; $i < $n; $i++) {
            $values[] = $i;
        }

        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;

        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to Vector:        n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
// SplStack is a subclass of SplDoublyLinkedList, so it is a linked list that takes more memory than needed.
// Access to values in the middle of the SplStack is also less efficient.
function bench_spl_stack(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new SplStack();
        for ($i = 0; $i < $n; $i++) {
            $values->push($i);
        }
        $startReadTime = hrtime(true);
        // Random access is linear time in a linked list, so use foreach instead
        foreach ($values as $value) {
            $total += $value;
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to SplStack:      n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
function bench_spl_fixed_array(int $n, int $iterations) {
    $startTime = hrtime(true);
    $totalReadTime = 0.0;
    $total = 0;
    for ($j = 0; $j < $iterations; $j++) {
        $startMemory = memory_get_usage();
        $values = new SplFixedArray();
        for ($i = 0; $i < $n; $i++) {
            // Imitate how push() would be implemented in a situation
            // where the number of elements wasn't actually known ahead of time.
            // erealloc() tends to extend the existing array when possible.
            $size = $values->getSize();
            $values->setSize($size + 1);
            $values->offsetSet($size, $i);
        }
        $startReadTime = hrtime(true);
        for ($i = 0; $i < $n; $i++) {
            $total += $values[$i];
        }
        $endReadTime = hrtime(true);
        $totalReadTime += $endReadTime - $startReadTime;
        $endMemory = memory_get_usage();
        unset($values);
    }
    $endTime = hrtime(true);
    $totalTime = ($endTime - $startTime) / 1000000000;
    $totalReadTimeSeconds = $totalReadTime / 1000000000;
    printf("Appending to SplFixedArray: n=%8d iterations=%8d memory=%8d bytes, create+destroy time=%.3f read time = %.3f result=%d\n\n",
        $n, $iterations, $endMemory - $startMemory, $totalTime - $totalReadTimeSeconds, $totalReadTimeSeconds, $total);
}
$n = 2**20;
$iterations = 10;
$sizes = [
    [1, 8000000],
    [4, 2000000],
    [8, 1000000],
    [2**20, 20],
];
printf(
    "Results for php %s debug=%s with opcache enabled=%s\n\n",
    PHP_VERSION,
    PHP_DEBUG ? 'true' : 'false',
    json_encode(function_exists('opcache_get_status') && (opcache_get_status(false)['opcache_enabled'] ?? false))
);

foreach ($sizes as [$n, $iterations]) {
    bench_array($n, $iterations);
    bench_vector($n, $iterations);
    bench_spl_stack($n, $iterations);
    bench_spl_fixed_array($n, $iterations);
    echo "\n";
}
Results for php 8.2.0-dev debug=false with opcache enabled=true

Appending to array:         n=       1 iterations= 8000000 memory=     376 bytes, create+destroy time=0.645 read time = 0.308 result=0
Appending to Vector:        n=       1 iterations= 8000000 memory=     128 bytes, create+destroy time=1.003 read time = 0.355 result=0
Appending to SplStack:      n=       1 iterations= 8000000 memory=     184 bytes, create+destroy time=1.737 read time = 0.742 result=0
Appending to SplFixedArray: n=       1 iterations= 8000000 memory=      80 bytes, create+destroy time=1.810 read time = 0.428 result=0


Appending to array:         n=       4 iterations= 2000000 memory=     376 bytes, create+destroy time=0.222 read time = 0.114 result=12000000
Appending to Vector:        n=       4 iterations= 2000000 memory=     128 bytes, create+destroy time=0.323 read time = 0.164 result=12000000
Appending to SplStack:      n=       4 iterations= 2000000 memory=     280 bytes, create+destroy time=0.739 read time = 0.301 result=12000000
Appending to SplFixedArray: n=       4 iterations= 2000000 memory=     128 bytes, create+destroy time=1.164 read time = 0.233 result=12000000


Appending to array:         n=       8 iterations= 1000000 memory=     376 bytes, create+destroy time=0.154 read time = 0.084 result=28000000
Appending to Vector:        n=       8 iterations= 1000000 memory=     192 bytes, create+destroy time=0.227 read time = 0.148 result=28000000
Appending to SplStack:      n=       8 iterations= 1000000 memory=     408 bytes, create+destroy time=0.530 read time = 0.240 result=28000000
Appending to SplFixedArray: n=       8 iterations= 1000000 memory=     192 bytes, create+destroy time=1.026 read time = 0.205 result=28000000


Appending to array:         n= 1048576 iterations=      20 memory=33558608 bytes, create+destroy time=0.699 read time = 0.151 result=10995105792000
Appending to Vector:        n= 1048576 iterations=      20 memory=16777304 bytes, create+destroy time=0.483 read time = 0.271 result=10995105792000
Appending to SplStack:      n= 1048576 iterations=      20 memory=33554584 bytes, create+destroy time=0.865 read time = 0.410 result=10995105792000
Appending to SplFixedArray: n= 1048576 iterations=      20 memory=16777304 bytes, create+destroy time=2.431 read time = 0.404 result=10995105792000

@mvorisek
Copy link
Contributor

As this type/class is proposed as final, is there any strong reason for an extra type/class than implementing this as an optimization of existing array type?

This way, developer would not have to care, a lot of exisitng apps will be faster, will use less memory, ... I see only positives.

@TysonAndre
Copy link
Contributor Author

TysonAndre commented Sep 12, 2021

As this type/class is proposed as final, is there any strong reason for an extra type/class than implementing this as an optimization of existing array type?

Adding a new type to php as a non-class is a massive undertaking for php-src itself and extension authors. It would not work with a lot of existing code that handled arrays and objects - I expect that is_vec would be a separate check from is_object and is_array, etc. This is part of why PHP 8.1 enum classes are an object type rather than a distinct type

See https://www.npopov.com/2015/05/05/Internal-value-representation-in-PHP-7-part-1.html for how php represents values internally.

  • That would also require a lot more familiarity than I have with opcache and the JIT assembly compiler, and (I expect it would) be more controversial due to not working with existing code

Additionally, adding a class doesn't prevent adding a vec/list in the future - for example, HHVM has both vec and https://docs.hhvm.com/hack/reference/class/HH.Vector/ , PHP has both array and ArrayObject, etc.

Also, even if that were done, vec and array would be distinct types - a vec couldn't be passed to a parameter that expected an array reference (or returned in a return value), because later adding a string array key (in the parameter or return value) would be a runtime error.

@krakjoe krakjoe added the RFC label Sep 13, 2021
}

return NULL;
return zend_interned_string_ht_lookup_ex(ZSTR_H(str), ZSTR_VAL(str), ZSTR_LEN(str), interned_strings);
Copy link
Contributor Author

@TysonAndre TysonAndre Sep 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikic FYI - I wonder if you're able to reproduce the test failure in the prior commit (in gcc 9.3.0-17ubuntu1~20.04) for make test TESTS='ext/spl/tests/Vector/aggregate.phpt -m --show-mem --show-diff (it segfaults in idx = Z_NEXT(p->val); when compiling the code and interning the strings, not when running it). It seems almost definitely related to running the code under valgrind with -O2 [-g]. Even after git clean -fdx and recompiling it's still an issue

  • Oddly, running valgrind tests in docker in other OSes (e.g. dockerhub php:8.1.0RC1 with gcc (Debian 10.2.1-6) 10.2.1 20210110) don't have this issue.
  • When I add fprintf(stderr statements, the issue goes away

(The segfault doesn't happen with gdb, or even gdb with USE_ZEND_ALLOC=0 - it's baffling.)

  • If you are able to reproduce this, it's probably a symptom of a larger issue that may affect other php 8.1/8.2 users using the same compiler
  • If you aren't, it still may be a good idea to avoid (1) repeating code and (2) redundant loads of ZSTR_LEN(str) from memory into registers if gcc can't infer that the assembly function zend_string_equal_val doesn't modify memory in the assembly block, and needs to scan over multiple buckets (probably not redundant if the hash is the same, hash collision chance is extremely low)

Options used to build in linux mint 20.2: export CFLAGS='-O2 -g'; ./buildconf --force; ./configure --disable-all --with-zlib --with-zip --enable-phar --with-bz2 --enable-opcache --with-readline --enable-tokenizer --prefix=$HOME/php-8.2-unoptimized-nts-vector-install --with-valgrind; make -j8

See spl_vector.stub.php for the userland API.

Earlier work on the implementation can be found at
https://github.com/TysonAndre/pecl-teds
and it can be tested out at https://pecl.php.net/package/teds as `Teds\Vector`
(currently the same apart from reusing spl's conversion of mixed to `int $offset`)

This was originally based on spl_fixedarray.c and previous work I did on an RFC.

Notable features:
- Roughly half the memory usage of (non-constant) arrays
  due to not needing a list of indexes of buckets separately
  from the zval buckets themselves.
- Same memory usage as SplFixedArray in 8.2 for a given **capacity**
- Lower memory usage and better performance than SplDoublyLinkedList
  or its subclasses (SplStack) due to being an array instead of a linked list
- More efficient resizing than SplFixedArray's setSize(getSize+1)

```php
final class Vector implements IteratorAggregate, Countable, JsonSerializable, ArrayAccess
{
	public function __construct(
		iterable $iterator = [],
		bool $preserveKeys = true
	) {}
	public function getIterator(): InternalIterator {}
	public function count(): int {}
	public function capacity(): int {}
	public function clear(): void {}
	public function setSize(int $size): void {}

	public function __serialize(): array {}
	public function __unserialize(array $data): void {}
	public static function __set_state(array $array): Vector {}

	public function push(mixed $value): void {}
	public function pop(): mixed {}

	public function toArray(): array {}
	// Strictly typed, unlike offsetGet/offsetSet
	public function valueAt(int $offset): mixed {}
	public function setValueAt(int $offset, mixed $value): void {}

	public function offsetGet(mixed $offset): mixed {}
	public function offsetExists(mixed $offset): bool {}
	public function offsetSet(mixed $offset, mixed $value): void {}
	// Throws because unset and null are different things.
	public function offsetUnset(mixed $offset): void {}

	public function indexOf(mixed $value): int|false {}
	public function contains(mixed $value): bool {}

	public function shrinkToFit(): void {}

	public function jsonSerialize(): array {}
}
```
I'm guessing this is a bug seen when inlining and one function using assembly
and the other not using assembly related to `zend_string_equal_content`?
At lower optimization levels it passes.
(Environment: gcc 9.3 on Linux)

`make test TESTS='ext/spl/tests/Vector/aggregate.phpt -m --show-mem --show-diff`
fails while compiling the variable in the foreach.
@runner78
Copy link

I would be suitable for using a different name for \Vector, which is basically also wrongly named in C++.

Quote from stackoverflow:
"Alex Stepanov, the designer of the Standard Template Library, was looking for a name to distinguish it from built-in arrays. He admits now that he made a mistake, because mathematics already uses the term 'vector' for a fixed-length sequence of numbers. C++11 compounds this mistake by introducing a class 'array' that behaves similarly to a mathematical vector."

@iluuu1994
Copy link
Member

@TysonAndre Is there any progress on this RFC?

@TysonAndre
Copy link
Contributor Author

Note that starting in php 8.2, arrays that are lists (with no/few gaps) are represented in a more memory efficient way than associative arrays.

Closing this in favor of the Deque rfc

  • If the Deque rfc passes, there's less of a need for Vector - they have similar performance (Vector being (and array) worse for shift/unshift) and the overhead of Deque memory offset calculation is negligible. The capacity of a Deque (and PHP arrays that aren't constants in opcache) must be a power of 2
  • If the Deque rfc doesn't pass, I wouldn't expect Vector to pass
  • PHP 8.2 reduces the memory usage of arrays that are used like lists and have a packed representation

@TysonAndre TysonAndre closed this Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants