Closed
Description
Description
The following code:
<?php
$Repeated = str_repeat( 'β', 4096 );
//$Repeated = str_repeat( 'π', 4096 );
$Data = '<!DOCTYPE HTML><html>' . $Repeated . '</html>';
$Document = \Dom\HTMLDocument::createFromString( $Data, 0, 'UTF-8' );
echo $Document->saveHTML();
// var_dump($Document->body->textContent);
The resulting string contains random invalid UTF-8 sequences like with the οΏ½ character. With the repeated emoji, emojis become corrupted. If you repeat the string for longer, there are more corrupted bytes in random places.
I initially spotted this bug when parsing a real HTML document and used textContent
(innerHTML produces the same issue) on an element I found with xpath.
PHP Version
PHP 8.4.2
Operating System
Windows 11 and Debian 12