Description
The buggy behavior
macOS (arm64
)
Running the following code produces an error:
% php -dmemory_limit=-1 -r 'file_get_contents("big");'
PHP Notice: file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1
Notice: file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1
The function on macOS returns a 0-byte string, as verified by gettype(file_get_contents(...))
and strlen(file_get_contents(...)
. The file is almost 5GB in size:
% php -r 'echo filesize("big") . "\n";'
4694824521
Note the size reported: it appears that macOS tries to read exactly 8,192 bytes past the file size? this is probably not related, see below
Comparing to Linux (x86_64
)
On a fully updated Debian 13.0 the result is different:
$ php -dmemory_limit=-1 -r 'echo gettype(file_get_contents("big")) . "\n";'
string
$ php -dmemory_limit=-1 -r 'echo strlen(file_get_contents("big")) . "\n";'
4694824521
PHP Versions
macOS installed via Homebrew:
% php -v
PHP 8.4.7 (cli) (built: May 6 2025 12:31:58) (NTS)
Copyright (c) The PHP Group
Built by Shivam Mathur
Zend Engine v4.4.7, Copyright (c) Zend Technologies
with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies
Linux:
$ php -v
PHP 8.4.7 (cli) (built: May 9 2025 07:02:39) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.7, Copyright (c) Zend Technologies
with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies
Operating System
% sw_vers
ProductName: macOS
ProductVersion: 15.5
BuildVersion: 24F74
% uname -m
arm64
Looking for the culprit
How it fails?
While I am not a C developer, nor I have great familiarity with ZE codebase, I tried to take a crack at this. The error seems to be coming from php_stdiop_read()
:
php-src/main/streams/plain_wrapper.c
Lines 446 to 448 in 359bb63
Initially, I was thinking it's about the 4GB size, or the error reporting a size off by 8K from the real file size, but it doesn't seem to be the case. In fact, any read larger than or equal to 2GB will fail:
php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024 - 1));
2147483647
php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024));
PHP Notice: file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
Notice: file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
0
file_get_contents()
fails only for regular files, regardless of the underlying filesystem (tested on regular APFS & HFS+ ramdisk):
php > echo strlen(file_get_contents('/dev/zero', length: 5 * 1024 * 1024 * 1024));
5368709120
Issue seems to be isolated to file_get_contents()
only. My initial hunch of reads in chunks larger than SSIZE_MAX
also led to nowhere, as a single fread()
is able to read the file as well:
php > echo strlen(fread(fopen('big', 'r'), filesize('big')));
4694824521
php > var_dump(stream_copy_to_stream(fopen('big','r'), fopen('dst','w')));
int(4694824521)
The issue is also not related to an old bug 69824 of mine with variables >2GB, as on modern PHP versions creating a 5GB (i.e. larger than the file) isn't a problem.
I also couldn't replicate it using PHP code that doesn't use file_get_contents()
.
Why it fails?
If I'm reading the file_get_contents()
implementation for files correctly, it will call _php_stream_copy_to_mem()
, which then calls universal _php_stream_read()
that calls stream->ops->read()
on the stream. I think that call on the stream is set to php_stdiop_read()
.
I suspected that the read(3)
is being called with the full $length
, as passed to fgc. This points to behavior of read(3)
being different between Darwin and Linux.
I wrote a quick C reproducer and tested:
### macOS
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
!! read() failed - errno=22 err=Invalid argument
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147483647 bytes ($req-$actual=0)
### Linux
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4096)
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4095)
Linux accepts arbitrary size to read(3)
and simply returns maximum amount possible (hmm, 2GB-4K??), which lets the stream logic handle stitching. Darwin/XNU and BSD kernels instead immediately returns EINVAL
if requested chunk size is larger than INT_MAX
.
The same problem also affects file_put_contents()
for the same reasons.
Possible fix?
This behavior appears to be known, as stream_set_chunk_size()
errors-out if requested chunk size is > INT_MAX
on all platforms. Moreover, while debugging I did a full circle: the php_stdiop_read()
does clamp the max chunk/buffer to INT_MAX
but only on Windows.
I think adding the clamping for macOS and BSD, in addition to Windows, is the simplest solution - PR provided.
Affected versions
The issue will only appear if the stream read buffer is set > INT_MAX
, which in the case of file_get_contents()
bisects to commit 6beee1a from #8547 that first landed in PHP 8.2.
Knowing this I found this isn't a problem with just file_get_contents()
but also fread()
as stream_set_read_buffer()
doesn't guard this:
php > $f = fopen("big", "r"); stream_set_read_buffer($f, 0); fread($f, 2147483648);
PHP Notice: fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
Notice: fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
However, I don't think this needs to be guarded even for DX, as this is a user shooting themselves into a foot. After the patch the code above will instead fail with Notice: fread(): Read of 2147483648 bytes failed with errno=9 Bad file descriptor
.
Dataset
The exact file I encounter a problem with is available from Cornell University. You can get it directly via curl -L -o ~/Downloads/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv
. However, after some digging I see it's not about this exact file, i.e. truncate -s 4694824521 big
works too.