Skip to content

Fix GH-11573: RecursiveDirectoryIterator::hasChildren is slow #11577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions UPGRADING.INTERNALS
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,9 @@ PHP 8.3 INTERNALS UPGRADE NOTES
- zend_set_user_opcode_handler
- zend_ssa_inference
* Removed unused macros PHP_FNV1_32A_INIT and PHP_FNV1A_64_INIT. See GH-11114.
* _php_stream_dirent now has an extra d_type field that is used to store the
directory entry type. This can be used to avoid additional stat calls for
types when the type is already known.

========================
2. Build system changes
Expand Down Expand Up @@ -115,6 +118,7 @@ PHP 8.3 INTERNALS UPGRADE NOTES
e. ext/spl
- The PHPAPI spl_iterator_apply() function now returns zend_result instead of int.
There are no functional changes.
- The field _spl_filesystem_object->is_recursive has been removed.

f. ext/dom
- A new function dom_get_doc_props_read_only() is added to gather the document
Expand Down
10 changes: 10 additions & 0 deletions Zend/zend_virtual_cwd.h
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,16 @@ typedef unsigned short mode_t;
#else
#ifdef HAVE_DIRENT_H
#include <dirent.h>

#ifndef DT_UNKNOWN
# define DT_UNKNOWN 0
#endif
#ifndef DT_DIR
# define DT_DIR 4
#endif
#ifndef DT_REG
# define DT_REG 8
#endif
#endif

#define DEFAULT_SLASH '/'
Expand Down
1 change: 1 addition & 0 deletions ext/phar/dirstream.c
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ static ssize_t phar_dir_read(php_stream *stream, char *buf, size_t count) /* {{{
memset(buf, 0, sizeof(php_stream_dirent));
memcpy(((php_stream_dirent *) buf)->d_name, ZSTR_VAL(str_key), to_read);
((php_stream_dirent *) buf)->d_name[to_read + 1] = '\0';
((php_stream_dirent *) buf)->d_type = DT_UNKNOWN;

return sizeof(php_stream_dirent);
}
Expand Down
7 changes: 5 additions & 2 deletions ext/spl/spl_directory.c
Original file line number Diff line number Diff line change
Expand Up @@ -759,8 +759,6 @@ static void spl_filesystem_object_construct(INTERNAL_FUNCTION_PARAMETERS, zend_l

}
zend_restore_error_handling(&error_handling);

intern->u.dir.is_recursive = instanceof_function(intern->std.ce, spl_ce_RecursiveDirectoryIterator) ? 1 : 0;
}
/* }}} */

Expand Down Expand Up @@ -1487,6 +1485,11 @@ PHP_METHOD(RecursiveDirectoryIterator, hasChildren)
if (spl_filesystem_is_invalid_or_dot(intern->u.dir.entry.d_name)) {
RETURN_FALSE;
} else {
if (intern->u.dir.entry.d_type == DT_DIR) {
Copy link
Contributor

@mvorisek mvorisek Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the https://github.com/php/php-src/blob/php-8.2.7/ext/spl/spl_directory.c#L1247 code it seems functions like DirectoryIterator::isDir() currently involve stat syscall too.

There are used a lot in Symfony like in https://github.com/symfony/symfony/blob/v6.3.1/src/Symfony/Component/Finder/Iterator/ExcludeDirectoryFilterIterator.php#L62

Can DirectoryIterator::isDir(), DirectoryIterator::isFile() and maybe also isLink/getType be optimized the same way? Ie. if it is known the file entry is file or directory, return cached result early.

UPDATE:

DirectoryIterator::isDir() is cached using file status cache.

So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type in iterator.

Then DirectoryIterator::isDir() will be no-syscall and the cache will be clearable via clearstatcache().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type in iterator.

The stat cache was already used prior to this PR. The problem is that there were stat calls in between for other files/folders, such that the cache entry couldn't be reused and the stat call had to happen again. So this comment does not make sense to me.

Copy link
Contributor

@mvorisek mvorisek Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to pupulate the PHP stat cache here:

https://github.com/php/php-src/blob/05626ba83c/main/streams/plain_wrapper.c#L1035-L1039

based on the readdir result.

Then the stat in RecursiveDirectoryIterator::hasChildren should be resolved thru the PHP stat cache.

Currently I belive this is not done, as the file type result from readdir was never honored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory entries don't give the same information as stat does. As such, the stat cache cannot be filled in from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mvorisek mvorisek Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did benchmark on Windows 10 and NVMe SSD on php-src (21640 files, 449 directories).

/wo ->files():

  • 1.3 secs
  • 24 secs /w open_basedir! (due slow realpath)
  • 0.15 secs (/w this PR, I emulated the hasChildren optimization by a precomputed list of directories)

now /w ->files():

  • 1.3 secs
  • 24 secs
  • 1.3 secs (/w this PR)
  • 24 secs (/w this PR and open_basedir)

In "now /w ->files()" each iterated path is tested twice, once for is_dir for hasChildren, once for is_file(). The reason the times are the same is because the 2nd stat is cached. This is why I propose to transfer the knowledge from read_dir d_type to cache, even if the cache will be for last iterated entry will help these file discovery scenarios (time reduced from 1.3 (or 23 /w open_basedir) secs to 0.15 secs). Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.
Also: 24 secs /w open_basedir! (due slow realpath) WOW, I guess we need to look into that ...

Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.

But then we spend time managing a cache for a very specific use-case. I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in spl_filesystem_object_create_type(). This is called for example if ->current() is called on a FileSystemIterator. Since Symfony constructs the file info manually instead from within the iterator, this gave a performance degradation of 5% because we paid the cost of the caching without getting any benefit. This demonstrates that even for lightweight caching, we must be careful with adding these kinds of optimizations: they can actually result in a slowdown.

This is why I propose (...)

Talk is cheap, show me the code ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.

I am confident here, this PR avoids IO syscalls for hasChildren and that is what I did...

Also: 24 secs /w open_basedir! (due slow realpath) WOW, I guess we need to look into that ...

That is known issue with slow realpath on Windows - https://bugs.php.net/bug.php?id=53263 However, any improvement can imply big performance boost, as the realpath is really slow, there may be unneeded allocation, unoptimal syscalls etc..

Talk is cheap, show me the code ;)

This PR is perfect. When no is_dir/is_file is called during the iteration, for regular dirs/files, the travesing is much faster than before. But when is_dir/is_file is called (or OOP equivalents like SplFileInfo::isDir()), the performance is as before, because originally the hasChildren called is_dir internally. And it was cached, thus any additional is_dir was fast, as it was cached thru stat cache.

So I propose, to address the basic file discovery usecase, to transfer the knowledge from the readdir d_type to the single entry file stat cache as well. As the knowledge has not the full data as stat has, the cache will be valid for is_file and is_dir only.

This will then make code like, even without OOP iterator, very fast:

$p = '/path/to/files';
$handle = opendir($p);
$dirs = [];
while (false !== ($entry = readdir($handle))) {
    if (is_dir($p . '/' . $entry)) { // should be resolved thru stat cache from last readdir d_type result (when regular dir/file)
        $dirs[] = $entry;
    }
}
closedir($handle);

As partial stat file cache entry might be not expected by other C extensions, I think it is feasible to implement it as another single entry per thread cache. And in is_file/is_dir simply compare if the cache entry path is the same, if yes, use the cached entry. This cache should be cleared like the current stat cache is, ie. if file is moved/deleted or using clearstatcache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ok in the current state. Further improvements can be done later of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in spl_filesystem_object_create_type(). This is called for example if ->current() is called on a FileSystemIterator. Since Symfony constructs the file info manually instead from within the iterator

@nielsdos I agree the transfering into SplFileInfo might be too edge usecase. But can the iterator fastpath optimization be used for isFile/isDir on the same iterator object as the hasChildren does? This will allow the fastpath knowledge to be accessible to userland. It should have close to zero performance penalty.

RETURN_TRUE;
} else if (intern->u.dir.entry.d_type == DT_REG) {
RETURN_FALSE;
}
if (spl_filesystem_object_get_file_name(intern) == FAILURE) {
RETURN_THROWS();
}
Expand Down
1 change: 0 additions & 1 deletion ext/spl/spl_directory.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,6 @@ struct _spl_filesystem_object {
php_stream *dirp;
zend_string *sub_path;
int index;
int is_recursive;
zend_function *func_rewind;
zend_function *func_next;
zend_function *func_valid;
Expand Down
1 change: 1 addition & 0 deletions ext/standard/ftp_fopen_wrapper.c
Original file line number Diff line number Diff line change
Expand Up @@ -629,6 +629,7 @@ static ssize_t php_ftp_dirstream_read(php_stream *stream, char *buf, size_t coun
memcpy(ent->d_name, ZSTR_VAL(basename), tmp_len);
ent->d_name[tmp_len - 1] = '\0';
zend_string_release_ex(basename, 0);
ent->d_type = DT_UNKNOWN;

/* Trim off trailing whitespace characters */
while (tmp_len > 0 &&
Expand Down
5 changes: 5 additions & 0 deletions main/php_streams.h
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,12 @@ typedef struct _php_stream_statbuf {
} php_stream_statbuf;

typedef struct _php_stream_dirent {
#ifdef NAME_MAX
char d_name[NAME_MAX + 1];
#else
char d_name[MAXPATHLEN];
#endif
unsigned char d_type;
} php_stream_dirent;

/* operations on streams that are file-handles */
Expand Down
1 change: 1 addition & 0 deletions main/streams/glob_wrapper.c
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ static ssize_t php_glob_stream_read(php_stream *stream, char *buf, size_t count)
php_glob_stream_path_split(pglob, pglob->glob.gl_pathv[index], pglob->flags & GLOB_APPEND, &path);
++pglob->index;
PHP_STRLCPY(ent->d_name, path, sizeof(ent->d_name), strlen(path));
ent->d_type = DT_UNKNOWN;
return sizeof(php_stream_dirent);
}
pglob->index = glob_result_count;
Expand Down
5 changes: 5 additions & 0 deletions main/streams/plain_wrapper.c
Original file line number Diff line number Diff line change
Expand Up @@ -1032,6 +1032,11 @@ static ssize_t php_plain_files_dirstream_read(php_stream *stream, char *buf, siz
result = readdir(dir);
if (result) {
PHP_STRLCPY(ent->d_name, result->d_name, sizeof(ent->d_name), strlen(result->d_name));
#ifdef _DIRENT_HAVE_D_TYPE
ent->d_type = result->d_type;
#else
ent->d_type = DT_UNKNOWN;
#endif
return sizeof(php_stream_dirent);
}
return 0;
Expand Down
1 change: 1 addition & 0 deletions main/streams/userspace.c
Original file line number Diff line number Diff line change
Expand Up @@ -1320,6 +1320,7 @@ static ssize_t php_userstreamop_readdir(php_stream *stream, char *buf, size_t co
if (call_result == SUCCESS && Z_TYPE(retval) != IS_FALSE && Z_TYPE(retval) != IS_TRUE) {
convert_to_string(&retval);
PHP_STRLCPY(ent->d_name, Z_STRVAL(retval), sizeof(ent->d_name), Z_STRLEN(retval));
ent->d_type = DT_UNKNOWN;

didread = sizeof(php_stream_dirent);
} else if (call_result == FAILURE) {
Expand Down
7 changes: 7 additions & 0 deletions win32/readdir.c
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,13 @@ struct dirent *readdir(DIR *dp)

dp->dent.d_ino = 1;
dp->dent.d_off = dp->offset;
if (dp->fileinfo.dwFileAttributes & (FILE_ATTRIBUTE_REPARSE_POINT | FILE_ATTRIBUTE_DEVICE)) {
dp->dent.d_type = DT_UNKNOWN; /* conservative */
} else if (dp->fileinfo.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) {
dp->dent.d_type = DT_DIR;
} else {
dp->dent.d_type = DT_REG;
}

return &(dp->dent);
}/*}}}*/
Expand Down
6 changes: 6 additions & 0 deletions win32/readdir.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,11 +15,17 @@ extern "C" {

#include "ioutil.h"

#define _DIRENT_HAVE_D_TYPE
#define DT_UNKNOWN 0
#define DT_DIR 4
#define DT_REG 8

/* struct dirent - same as Unix */
struct dirent {
long d_ino; /* inode (always 1 in WIN32) */
off_t d_off; /* offset to this dirent */
unsigned short d_reclen; /* length of d_name */
unsigned char d_type;
char d_name[1]; /* null terminated filename in the current encoding, glyph number <= 255 wchar_t's + \0 byte */
};

Expand Down