Skip to content

file_get_contents() and file_put_contents() fail with data >=2GB on macOS & BSD #18753

Open
@kiler129

Description

@kiler129

The buggy behavior

macOS (arm64)

Running the following code produces an error:

% php -dmemory_limit=-1 -r 'file_get_contents("big");'
PHP Notice:  file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1

Notice: file_get_contents(): Read of 4694832713 bytes failed with errno=22 Invalid argument in Command line code on line 1

The function on macOS returns a 0-byte string, as verified by gettype(file_get_contents(...)) and strlen(file_get_contents(...). The file is almost 5GB in size:

% php -r 'echo filesize("big") . "\n";'
4694824521

Note the size reported: it appears that macOS tries to read exactly 8,192 bytes past the file size? this is probably not related, see below

Comparing to Linux (x86_64)

On a fully updated Debian 13.0 the result is different:

$ php -dmemory_limit=-1 -r 'echo gettype(file_get_contents("big")) . "\n";'
string

$  php -dmemory_limit=-1 -r 'echo strlen(file_get_contents("big")) . "\n";'
4694824521

PHP Versions

macOS installed via Homebrew:

% php -v
PHP 8.4.7 (cli) (built: May  6 2025 12:31:58) (NTS)
Copyright (c) The PHP Group
Built by Shivam Mathur
Zend Engine v4.4.7, Copyright (c) Zend Technologies
    with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies

Linux:

$ php -v
PHP 8.4.7 (cli) (built: May  9 2025 07:02:39) (NTS)
Copyright (c) The PHP Group
Zend Engine v4.4.7, Copyright (c) Zend Technologies
    with Zend OPcache v8.4.7, Copyright (c), by Zend Technologies

Operating System

% sw_vers
ProductName:		macOS
ProductVersion:		15.5
BuildVersion:		24F74

% uname -m
arm64

Looking for the culprit

How it fails?

While I am not a C developer, nor I have great familiarity with ZE codebase, I tried to take a crack at this. The error seems to be coming from php_stdiop_read():

if (!(stream->flags & PHP_STREAM_FLAG_SUPPRESS_ERRORS)) {
php_error_docref(NULL, E_NOTICE, "Read of %zu bytes failed with errno=%d %s", count, errno, strerror(errno));
}

Initially, I was thinking it's about the 4GB size, or the error reporting a size off by 8K from the real file size, but it doesn't seem to be the case. In fact, any read larger than or equal to 2GB will fail:

php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024 - 1));
2147483647

php > echo strlen(file_get_contents('big', length: 2 * 1024 * 1024 * 1024));
PHP Notice:  file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1

Notice: file_get_contents(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1
0

file_get_contents() fails only for regular files, regardless of the underlying filesystem (tested on regular APFS & HFS+ ramdisk):

php > echo strlen(file_get_contents('/dev/zero', length: 5 * 1024 * 1024 * 1024));
5368709120

Issue seems to be isolated to file_get_contents() only. My initial hunch of reads in chunks larger than SSIZE_MAX also led to nowhere, as a single fread() is able to read the file as well:

php > echo strlen(fread(fopen('big', 'r'), filesize('big')));
4694824521

php > var_dump(stream_copy_to_stream(fopen('big','r'), fopen('dst','w')));
int(4694824521)

The issue is also not related to an old bug 69824 of mine with variables >2GB, as on modern PHP versions creating a 5GB (i.e. larger than the file) isn't a problem.
I also couldn't replicate it using PHP code that doesn't use file_get_contents().

Why it fails?

If I'm reading the file_get_contents() implementation for files correctly, it will call _php_stream_copy_to_mem(), which then calls universal _php_stream_read() that calls stream->ops->read() on the stream. I think that call on the stream is set to php_stdiop_read().

I suspected that the read(3) is being called with the full $length, as passed to fgc. This points to behavior of read(3) being different between Darwin and Linux.
I wrote a quick C reproducer and tested:

### macOS
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
!! read() failed - errno=22 err=Invalid argument
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147483647 bytes ($req-$actual=0)


### Linux
Platform SSIZE_MAX=9223372036854775807
Platform INT_MAX=2147483647
=================================
Trying to get 2147483648 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4096)
=================================
Trying to get 2147483647 from big
File "big" opened, allocating memory...
Memory allocated, attempting read...
Did read 2147479552 bytes ($req-$actual=4095)

Linux accepts arbitrary size to read(3) and simply returns maximum amount possible (hmm, 2GB-4K??), which lets the stream logic handle stitching. Darwin/XNU and BSD kernels instead immediately returns EINVAL if requested chunk size is larger than INT_MAX.
The same problem also affects file_put_contents() for the same reasons.

Possible fix?

This behavior appears to be known, as stream_set_chunk_size() errors-out if requested chunk size is > INT_MAX on all platforms. Moreover, while debugging I did a full circle: the php_stdiop_read() does clamp the max chunk/buffer to INT_MAX but only on Windows.
I think adding the clamping for macOS and BSD, in addition to Windows, is the simplest solution - PR provided.

Affected versions

The issue will only appear if the stream read buffer is set > INT_MAX, which in the case of file_get_contents() bisects to commit 6beee1a from #8547 that first landed in PHP 8.2.

Knowing this I found this isn't a problem with just file_get_contents() but also fread() as stream_set_read_buffer() doesn't guard this:

php > $f = fopen("big", "r"); stream_set_read_buffer($f, 0); fread($f, 2147483648);
PHP Notice:  fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1

Notice: fread(): Read of 2147483648 bytes failed with errno=22 Invalid argument in php shell code on line 1

However, I don't think this needs to be guarded even for DX, as this is a user shooting themselves into a foot. After the patch the code above will instead fail with Notice: fread(): Read of 2147483648 bytes failed with errno=9 Bad file descriptor.

Dataset

The exact file I encounter a problem with is available from Cornell University. You can get it directly via curl -L -o ~/Downloads/arxiv.zip https://www.kaggle.com/api/v1/datasets/download/Cornell-University/arxiv. However, after some digging I see it's not about this exact file, i.e. truncate -s 4694824521 big works too.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions