Skip to content

UTF-8 corruption in \Dom\HTMLDocumentΒ #17481

Closed
@xPaw

Description

@xPaw

Description

The following code:

<?php
$Repeated = str_repeat( '–', 4096 );
//$Repeated = str_repeat( '😏', 4096 );
$Data = '<!DOCTYPE HTML><html>' . $Repeated . '</html>';
$Document = \Dom\HTMLDocument::createFromString( $Data, 0, 'UTF-8' );

echo $Document->saveHTML();
// var_dump($Document->body->textContent);

The resulting string contains random invalid UTF-8 sequences like with the οΏ½ character. With the repeated emoji, emojis become corrupted. If you repeat the string for longer, there are more corrupted bytes in random places.

I initially spotted this bug when parsing a real HTML document and used textContent (innerHTML produces the same issue) on an element I found with xpath.

image

PHP Version

PHP 8.4.2

Operating System

Windows 11 and Debian 12

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions