Skip to content

Fix #70962: xml_parse_into_struct strips embedded whitespace with XML… #7493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

Flashwade1990
Copy link

…_OPTION_SKIP_WHITE.

@cmb69 cmb69 added the Bug label Sep 15, 2021
@cmb69 cmb69 self-assigned this Sep 15, 2021
Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! The patch looks generally good to me. I left some comments on details to improve.

One thing that makes me wonder is the description of XML_OPTION_SKIP_WHITE in the manual:

Whether to skip values consisting of whitespace characters.

That makes me think that not only the value element of the respective $values would be skipped, but the complete $values element (i.e. for the given test script count($values) === 3). However, changing this would be a BC break, and given the long-standing behavior, it might be better to clarify the documentation.

--EXPECT--
<d>
<e>
bool(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing line break

ext/xml/xml.c Outdated
} else {
zend_string_release_ex(decoded_value, 0);
}
if (parser->lastwasopen) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use tabs (as four spaces) for indentation.

ext/xml/xml.c Outdated

/* check if the current tag already has a value - if yes append to that! */
if ((myval = zend_hash_str_find(Z_ARRVAL_P(parser->ctag), "value", sizeof("value") - 1))) {
int newlen = Z_STRLEN_P(myval) + ZSTR_LEN(decoded_value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int newlen = Z_STRLEN_P(myval) + ZSTR_LEN(decoded_value);
size_t newlen = Z_STRLEN_P(myval) + ZSTR_LEN(decoded_value);

That was int before this patch, but I think we should fix that right away.

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

I just noticed that empty "cdata" nodes have an empty "value" even if XML_OPTION_SKIP_WHITE is set; this patch doesn't change that. Not sure if it should, though. Probably not, since this is actually a different issue, but it makes it harder to document the details. :(

PS: it is actually different, see https://3v4l.org/ue9dU and https://3v4l.org/GTZtn. So, if XML_OPTION_SKIP_WHITE is set, and the cdata node is empty or contains whitespace only, the whole element is skipped. This should likely be the same for "complete" elements, but for BC reasons I wouldn't want to change that in a stable version. Sigh.

@Flashwade1990
Copy link
Author

I just noticed that empty "cdata" nodes have an empty "value" even if XML_OPTION_SKIP_WHITE is set; this patch doesn't change that. Not sure if it should, though. Probably not, since this is actually a different issue, but it makes it harder to document the details. :(

Unfortunately I was not able to find any information regarding XML_OPTION_SKIP_WHITE behavior as well. I may try to work on skipping the whole value object if it makes sense, but it will introduce BC break as you have already mentioned.

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

Unfortunately I was not able to find any information regarding XML_OPTION_SKIP_WHITE behavior as well. I may try to work on skipping the whole value object if it makes sense, but it will introduce BC break as you have already mentioned.

I suggest to leave this as is for now, but I noticed a change due to this PR regarding cdata element. Please see https://3v4l.org/L4vTC; this has no "cdata" element, but with this PR there is one now. I think that needs to fixed. This extension's test suite is definitely not comprehensive enough. :(

@Flashwade1990
Copy link
Author

Unfortunately I was not able to find any information regarding XML_OPTION_SKIP_WHITE behavior as well. I may try to work on skipping the whole value object if it makes sense, but it will introduce BC break as you have already mentioned.

I suggest to leave this as is for now, but I noticed a change due to this PR regarding cdata element. Please see https://3v4l.org/L4vTC; this has no "cdata" element, but with this PR there is one now. I think that needs to fixed. This extension's test suite is definitely not comprehensive enough. :(

Ok I will leave this PR as it is. Does it make sense to work on a separate PR for skipping an empty "cdata" element as it was before?

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

Ok I will leave this PR as it is. Does it make sense to work on a separate PR for skipping an empty "cdata" element as it was before?

I think this needs to be fixed in this PR; otherwise we would introduce a BC break.

By the way, do you test locally with libxml2 or libexpat? I don't think that we have CI testing with libexpat, but the behavior may be different.

@Flashwade1990
Copy link
Author

Ok I will leave this PR as it is. Does it make sense to work on a separate PR for skipping an empty "cdata" element as it was before?

I think this needs to be fixed in this PR; otherwise we would introduce a BC break.

By the way, do you test locally with libxml2 or libexpat? I don't think that we have CI testing with libexpat, but the behavior may be different.

Ok, will try to fix this issue in the current PR.

Based on compile info it is libxml2. Sorry, I am really new in C and do not know how to change library and test using libexpat. I need to dig deeper into code to find a way.

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

No need for you to check with libexpat; I can do that. :)

@Flashwade1990
Copy link
Author

No need for you to check with libexpat; I can do that. :)

Thank you :)

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

Something like the following might be sufficient to solve the cdata issue:

 ext/xml/xml.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/ext/xml/xml.c b/ext/xml/xml.c
index 45fd06971d..273e523d94 100644
--- a/ext/xml/xml.c
+++ b/ext/xml/xml.c
@@ -948,7 +948,9 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 
                     add_assoc_string(&tag, "tag", SKIP_TAGSTART(parser->ltags[parser->level-1]));
                     add_assoc_str(&tag, "value", decoded_value);
-                    add_assoc_string(&tag, "type", "cdata");
+					if (!parser->skipwhite) {
+                    	add_assoc_string(&tag, "type", "cdata");
+					}
                     add_assoc_long(&tag, "level", parser->level);
 
                     zend_hash_next_index_insert(Z_ARRVAL(parser->data), &tag);

This diff won't cleanly apply, because it's based on an earlier version of this PR, but you can apply manually. :)

PS: Nope, disregard this. Sorry for the noise.

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

Maybe we should go for a simpler fix?

 ext/xml/xml.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/ext/xml/xml.c b/ext/xml/xml.c
index fd8aebe03a..df052b8910 100644
--- a/ext/xml/xml.c
+++ b/ext/xml/xml.c
@@ -900,7 +900,6 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 					break;
 				}
 			}
-			if (doprint || (! parser->skipwhite)) {
 				if (parser->lastwasopen) {
 					zval *myval;
 
@@ -915,7 +914,7 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 						add_assoc_str(parser->ctag, "value", decoded_value);
 					}
 
-				} else {
+				} else if (doprint || (! parser->skipwhite)) {
 					zval tag;
 					zval *curtag, *mytype, *myval;
 
@@ -949,7 +948,6 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 					} else if (parser->level == (XML_MAXLEVEL + 1)) {
 											php_error_docref(NULL, E_WARNING, "Maximum depth exceeded - Results truncated");
 					}
-				}
 			} else {
 				zend_string_release_ex(decoded_value, 0);
 			}

And then fix the indentation, and intsize_t. I have not tested carefully, but it seems to work, and not to introduce any BC breaks.

@Flashwade1990
Copy link
Author

Maybe we should go for a simpler fix?

 ext/xml/xml.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/ext/xml/xml.c b/ext/xml/xml.c
index fd8aebe03a..df052b8910 100644
--- a/ext/xml/xml.c
+++ b/ext/xml/xml.c
@@ -900,7 +900,6 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 					break;
 				}
 			}
-			if (doprint || (! parser->skipwhite)) {
 				if (parser->lastwasopen) {
 					zval *myval;
 
@@ -915,7 +914,7 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 						add_assoc_str(parser->ctag, "value", decoded_value);
 					}
 
-				} else {
+				} else if (doprint || (! parser->skipwhite)) {
 					zval tag;
 					zval *curtag, *mytype, *myval;
 
@@ -949,7 +948,6 @@ void _xml_characterDataHandler(void *userData, const XML_Char *s, int len)
 					} else if (parser->level == (XML_MAXLEVEL + 1)) {
 											php_error_docref(NULL, E_WARNING, "Maximum depth exceeded - Results truncated");
 					}
-				}
 			} else {
 				zend_string_release_ex(decoded_value, 0);
 			}

And then fix the indentation, and intsize_t. I have not tested carefully, but it seems to work, and not to introduce any BC breaks.

Tag <c>\n \t</c> will not be skipped in this case and will be present in resulting array with value:
array(4) { ["tag"]=> string(1) "C" ["type"]=> string(8) "complete" ["level"]=> int(2) ["value"]=> string(3) " "

@cmb69
Copy link
Member

cmb69 commented Sep 15, 2021

Oh, right! Just when I thought I understand what the function is currently doing, I have to learn more about it. :)

@Flashwade1990
Copy link
Author

Oh, right! Just when I thought I understand what the function is currently doing, I have to learn more about it. :)

Latest commit returns an original behavior for skipping empty "cdata" elements. I think all cases are covered now and there is no BC break anymore :)

Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is good now, and also works fine with libexpat. I'm going to commit.

@cmb69 cmb69 closed this in a9661a5 Sep 16, 2021
@radu-barbu-sage
Copy link

radu-barbu-sage commented Jul 8, 2022

@Flashwade1990 - This introduces a bug, a CRLF and multiple spaces gets added to the structure.

using the following code snipet

<title>Dynamic Allocations</title> Configuring_Account_Allocation

$p = xml_parser_create('UTF-8');
xml_parser_set_option($p, XML_OPTION_CASE_FOLDING, 0);
xml_parser_set_option($p, XML_OPTION_SKIP_WHITE, true);
xml_parse_into_struct($p, $xml, $vals, $index);
$errcode = xml_get_error_code($p);

echo $errcode;
xml_parser_free($p);
var_dump($vals[7]);

result on onlinephp.io:
image

having the next tag on the same line as the ending CDATA tag is a workaround:

image

@cmb69
Copy link
Member

cmb69 commented Jul 8, 2022

@radu-barbu-sage, this is deliberate. XML_OPTION_SKIP_WHITE is documented to "Whether to skip values consisting of whitespace characters." If you want to remove surrounding whitespace, use trim().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants