-
-
Notifications
You must be signed in to change notification settings - Fork 32.2k
doc: recursive glob ** follows symlinks to directories #12918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: recursive glob ** follows symlinks to directories #12918
Conversation
Hello, and thanks for your contribution! I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA). Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue. You can check yourself to see if the CLA has been received. Thanks again for your contribution, we look forward to reviewing it! |
The initial purpose was to replace glob.glob(**, recursive=True) with pathlib.Path.glob(**) because the former follows symbolic links to directories in recursive mode (including "infinite" loops), whereas the latter (and more modern) doesn't. While trying to stick to this change, testing found a significant number of other issues and addressing all of them unfortunately turned into this significant rewrite. The entire Codeowners check is too small to isolate every change in separate commit, sorry for the resulting "bulk" review work load (~100 lines) Here's a simplified and extreme reproduction: - run a fairly extensive sanitycheck run. This creates 100,000s of files in sanity-out (thank you cmake) - observe how check_compliance.py -m Codeowners now takes minutes - create the following symbolic link: ln -s ../zephyr myself - observe how check_compliance.py -m Codeowners now takes hours or more. Only "hours" and not infinity because operating systems have limits to break infinite loops of symbolic links. Now considering Zephyr supports Windows and none of the multiple flavors of symbolic links offered by Windows is usable, it's unlikely Zephyr will ever _require_ symbolic links. However this shouldn't _forbid_ other users to use symbolic links privately. Funny enough neither glob nor pathlib documents whether they follow symbolic links to directories in recursive mode. "glob" does and "pathlib" doesn't. So I thought this was bug in glob and even had a tentative fix for it. Then I scanned the glob's git log and found it's actually a "feature": python/cpython#12918. glob's own documentation mentions pathlib as a higher-level and newer alternative: let's switch to pathlib. Other changes: - Fixed failing "path not found" detection of bogus lines in CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**', recursive=True) returns 'doesntexit/' instead of '' (easter egg: there is one bogus line right now) - Successfully tested on Windows. Requires hack replacing sh with pbs and disabling a couple unrelated imports though. Previous glob version failed to match, new files were never owned. - Remove (accidental?) limitation to run from the git top level. Moved git top-level command to a global and remove the same limitation on -m Checkpatch too. - Tighten the regular expression parsing CODEOWNERS lines to actually report invalid lines: make sure pattern starts in first character; make sure the owner is not mister whitespace. Performance: - Down from 1 sec to less than 0.5s in the most common case: not adding any new file. Fix broken optimization that still scanned the whole tree even when there are no new files (hint: [''] is not false) - About 20% faster on large sets without symbolic links (up to "infinitely" faster *with* symlinks) One possible reason: the list of owned files is now split per CODEOWNERS line for better logging which also means scanning smaller lists. Other minor fixes: - Remove unused git ls-files invocation, replaced with an optimization TODO. - Use more portable splitlines() instead of split('\n') - renamed parse_codeowners() to ls_owned_files() because but it does far more than parsing CODEOWNERS - Use "early returns" and extract new function to reduce cyclomatic complexity (up to 9 levels of indentation before in parse_codeowners(). Other minor simplifications. - Add some pydocs and logging Signed-off-by: Marc Herbert <marc.herbert@intel.com>
The initial purpose was to replace glob.glob(**, recursive=True) with pathlib.Path.glob(**) because the former follows symbolic links to directories in recursive mode (including "infinite" loops), whereas the latter (and more modern) doesn't. While trying to stick to this change, testing found a significant number of other issues and addressing all of them unfortunately turned into this significant rewrite. The entire Codeowners check is too small to isolate every change in separate commit, sorry for the resulting "bulk" review work load (~100 lines) Here's a simplified and extreme reproduction: - run a fairly extensive sanitycheck run. This creates 100,000s of files in sanity-out (thank you cmake) - observe how check_compliance.py -m Codeowners now takes minutes - create the following symbolic link: ln -s ../zephyr myself - observe how check_compliance.py -m Codeowners now takes hours or more. Only "hours" and not infinity because operating systems have limits to break infinite loops of symbolic links. Now considering Zephyr supports Windows and none of the multiple flavors of symbolic links offered by Windows is usable, it's unlikely Zephyr will ever _require_ symbolic links. However this shouldn't _forbid_ other users to use symbolic links privately. Funny enough neither glob nor pathlib documents whether they follow symbolic links to directories in recursive mode. "glob" does and "pathlib" doesn't. So I thought this was bug in glob and even had a tentative fix for it. Then I scanned the glob's git log and found it's actually a "feature": python/cpython#12918. glob's own documentation mentions pathlib as a higher-level and newer alternative: let's switch to pathlib. Other changes: - Fixed failing "path not found" detection of bogus lines in CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**', recursive=True) returns 'doesntexit/' instead of '' (easter egg: there is one bogus line right now) - Successfully tested on Windows. Requires hack replacing sh with pbs and disabling a couple unrelated imports though. Previous glob version failed to match, new files were never owned. - Remove (accidental?) limitation to run from the git top level. Moved git top-level command to a global and remove the same limitation on -m Checkpatch too. - Tighten the regular expression parsing CODEOWNERS lines to actually report invalid lines: make sure pattern starts in first character; make sure the owner is not mister whitespace. Performance: - Down from 1 sec to less than 0.5s in the most common case: not adding any new file. Fix broken optimization that still scanned the whole tree even when there are no new files (hint: [''] is not false) - About 20% faster on large sets without symbolic links (up to "infinitely" faster *with* symlinks) One possible reason: the list of owned files is now split per CODEOWNERS line for better logging which also means scanning smaller lists. Other minor fixes: - Remove unused git ls-files invocation, replaced with an optimization TODO. - Use more portable splitlines() instead of split('\n') - renamed parse_codeowners() to ls_owned_files() because but it does far more than parsing CODEOWNERS - Use "early returns" and extract new function to reduce cyclomatic complexity (up to 9 levels of indentation before in parse_codeowners(). Other minor simplifications. - Add some pydocs and logging Signed-off-by: Marc Herbert <marc.herbert@intel.com>
The initial purpose was to replace glob.glob(**, recursive=True) with pathlib.Path.glob(**) because the former follows symbolic links to directories in recursive mode (including "infinite" loops), whereas the latter (and more modern) doesn't. While trying to stick to this change, testing found a significant number of other issues and addressing all of them unfortunately turned into this significant rewrite. The entire Codeowners check is too small to isolate every change in separate commit, sorry for the resulting "bulk" review work load (~100 lines) Here's a simplified and extreme reproduction: - run a fairly extensive sanitycheck run. This creates 100,000s of files in sanity-out (thank you cmake) - observe how check_compliance.py -m Codeowners now takes minutes - create the following symbolic link: ln -s ../zephyr myself - observe how check_compliance.py -m Codeowners now takes hours or more. Only "hours" and not infinity because operating systems have limits to break infinite loops of symbolic links. Now considering Zephyr supports Windows and none of the multiple flavors of symbolic links offered by Windows is usable, it's unlikely Zephyr will ever _require_ symbolic links. However this shouldn't _forbid_ other users to use symbolic links privately. Funny enough neither glob nor pathlib documents whether they follow symbolic links to directories in recursive mode. "glob" does and "pathlib" doesn't. So I thought this was bug in glob and even had a tentative fix for it. Then I scanned the glob's git log and found it's actually a "feature": python/cpython#12918. glob's own documentation mentions pathlib as a higher-level and newer alternative: let's switch to pathlib. Other changes: - Fixed failing "path not found" detection of bogus lines in CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**', recursive=True) returns 'doesntexit/' instead of '' (easter egg: there is one bogus line right now) - Successfully tested on Windows. Requires hack replacing sh with pbs and disabling a couple unrelated imports though. Previous glob version failed to match, new files were never owned. - Remove (accidental?) limitation to run from the git top level. Moved git top-level command to a global and remove the same limitation on -m Checkpatch too. - Tighten the regular expression parsing CODEOWNERS lines to actually report invalid lines: make sure pattern starts in first character; make sure the owner is not mister whitespace. Performance: - Down from 1 sec to less than 0.5s in the most common case: not adding any new file. Fix broken optimization that still scanned the whole tree even when there are no new files (hint: [''] is not false) - About 20% faster on large sets without symbolic links (up to "infinitely" faster *with* symlinks) One possible reason: the list of owned files is now split per CODEOWNERS line for better logging which also means scanning smaller lists. Other minor fixes: - Remove unused git ls-files invocation, replaced with an optimization TODO. - Use more portable splitlines() instead of split('\n') - renamed parse_codeowners() to ls_owned_files() because but it does far more than parsing CODEOWNERS - Use "early returns" and extract new function to reduce cyclomatic complexity (up to 9 levels of indentation before in parse_codeowners(). Other minor simplifications. - Add some pydocs and logging Signed-off-by: Marc Herbert <marc.herbert@intel.com>
It is technically correct, but since my English is bad I'm not sure about the best phrasing. |
The initial purpose was to replace glob.glob(**, recursive=True) with pathlib.Path.glob(**) because the former follows symbolic links to directories in recursive mode (including "infinite" loops), whereas the latter (and more modern) doesn't. While trying to stick to this change, testing found a significant number of other issues and addressing all of them unfortunately turned into this significant rewrite. The entire Codeowners check is too small to isolate every change in separate commit, sorry for the resulting "bulk" review work load (~100 lines) Here's a simplified and extreme reproduction: - run a fairly extensive sanitycheck run. This creates 100,000s of files in sanity-out (thank you cmake) - observe how check_compliance.py -m Codeowners now takes minutes - create the following symbolic link: ln -s ../zephyr myself - observe how check_compliance.py -m Codeowners now takes hours or more. Only "hours" and not infinity because operating systems have limits to break infinite loops of symbolic links. Now considering Zephyr supports Windows and none of the multiple flavors of symbolic links offered by Windows is usable, it's unlikely Zephyr will ever _require_ symbolic links. However this shouldn't _forbid_ other users to use symbolic links privately. Funny enough neither glob nor pathlib documents whether they follow symbolic links to directories in recursive mode. "glob" does and "pathlib" doesn't. So I thought this was bug in glob and even had a tentative fix for it. Then I scanned the glob's git log and found it's actually a "feature": python/cpython#12918. glob's own documentation mentions pathlib as a higher-level and newer alternative: let's switch to pathlib. Other changes: - Fixed failing "path not found" detection of bogus lines in CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**', recursive=True) returns 'doesntexit/' instead of '' (easter egg: there is one bogus line right now) - Successfully tested on Windows. Requires hack replacing sh with pbs and disabling a couple unrelated imports though. Previous glob version failed to match, new files were never owned. - Remove (accidental?) limitation to run from the git top level. Moved git top-level command to a global and remove the same limitation on -m Checkpatch too. - Tighten the regular expression parsing CODEOWNERS lines to actually report invalid lines: make sure pattern starts in first character; make sure the owner is not mister whitespace. Performance: - Down from 1 sec to less than 0.5s in the most common case: not adding any new file. Fix broken optimization that still scanned the whole tree even when there are no new files (hint: [''] is not false) - About 20% faster on large sets without symbolic links (up to "infinitely" faster *with* symlinks) One possible reason: the list of owned files is now split per CODEOWNERS line for better logging which also means scanning smaller lists. Other minor fixes: - Remove unused git ls-files invocation, replaced with an optimization TODO. - Use more portable splitlines() instead of split('\n') - renamed parse_codeowners() to ls_owned_files() because but it does far more than parsing CODEOWNERS - Use "early returns" and extract new function to reduce cyclomatic complexity (up to 9 levels of indentation before in parse_codeowners(). Other minor simplifications. - Add some pydocs and logging Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@vstinner , @ncoghlan , @bitdancer, could you review this one-line doc change or help find someone who can? Just picking up semi-random names from the git log and from https://bugs.python.org/issue13968, sorry. |
My main question would be with the second half of the sentence: I think we follow symlinks regardless, but the revised wording makes it sound like appending (I suspect there's another error in that wording as well: when |
Good catch, thanks for the review. How about something like this then:
I will test os.altsep. UPDATE: a trailing os.altsep has the same effect as a trailing os.sep. Will mention both. |
@marc-h38 Aye, that revised wording looks good to me. |
The existing note does already warn that "Using the ** pattern in large directory trees may consume an inordinate amount of time", however following symbolic links to directories brings that to entirely new levels; learned the hard way. I initially assumed this was a bug, tentatively fixed with a simple: - if not dironly or entry.is_dir(): + if not dironly or entry.is_dir(follow_symlinks=False): ... in glob.py#_iterdir(dirname, dironly): Later I found https://bugs.python.org/issue13968 in the git log and saw this is working as intended with even a symlink loop test.
070276d
to
fd5babb
Compare
@ncoghlan , others: friendly ping? Should be a no-brainer now. |
Symlinks to files will not match as well. But symlinks to directories will match. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @marc-h38!
The proposed changes look good to me, but I would recommend some minor formatting adjustments:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Symlinks to files will not match either (when there's a trailing '/')
The sole purpose of this commit is to highlight that recursive glob ** treats symlinks to directories like directories. That's because symlinks to directories are relatively "evil": for instance they can easily cause infinite loops. Learned the hard way. For comparison, pathlib
does not recurse into symbolic links to directories.
On the other hand, there is nowhere near as much evil in symbolic link to files.
The current documentation does not make any reference to symbolic link to files, should I introduce this for the first time?
- ``os.sep`` or ``os.altsep`` + :data:`os.sep` or :data:`os.altsep` Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
@serhiy-storchaka do you think this should be expanded even more and should mention symbolic links to files too? (The risk being of course that no correction at all ever happens due to loss of interest from all parties ;-) |
Thanks @marc-h38 for the PR, and @JulienPalard for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8. |
Sorry @marc-h38 and @JulienPalard, I had trouble checking out the |
(cherry picked from commit e24594b) Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
GH-15995 is a backport of this pull request to the 3.7 branch. |
@marc-h38 thanks for enhancing the documentation, the current state of it looks good to me. Telling further about glob not "following symlink files" would probably just add too much and make the paragraph less easy to understand. Thanks! |
Thanks @marc-h38 for the PR, and @JulienPalard for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8. |
(cherry picked from commit e24594b) Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
GH-15996 is a backport of this pull request to the 3.8 branch. |
(cherry picked from commit e24594b) Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
(cherry picked from commit e24594b) Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
The initial purpose was to replace glob.glob(**, recursive=True) with pathlib.Path.glob(**) because the former follows symbolic links to directories in recursive mode (including "infinite" loops), whereas the latter (and more modern) doesn't. While trying to stick to this change, testing found a significant number of other issues and addressing all of them unfortunately turned into this significant rewrite. The entire Codeowners check is too small to isolate every change in separate commit, sorry for the resulting "bulk" review work load (~100 lines) Here's a simplified and extreme reproduction: - run a fairly extensive sanitycheck run. This creates 100,000s of files in sanity-out (thank you cmake) - observe how check_compliance.py -m Codeowners now takes minutes - create the following symbolic link: ln -s ../zephyr myself - observe how check_compliance.py -m Codeowners now takes hours or more. Only "hours" and not infinity because operating systems have limits to break infinite loops of symbolic links. Now considering Zephyr supports Windows and none of the multiple flavors of symbolic links offered by Windows is usable, it's unlikely Zephyr will ever _require_ symbolic links. However this shouldn't _forbid_ other users to use symbolic links privately. Funny enough neither glob nor pathlib documents whether they follow symbolic links to directories in recursive mode. "glob" does and "pathlib" doesn't. So I thought this was bug in glob and even had a tentative fix for it. Then I scanned the glob's git log and found it's actually a "feature": python/cpython#12918. glob's own documentation mentions pathlib as a higher-level and newer alternative: let's switch to pathlib. Other changes: - Fixed failing "path not found" detection of bogus lines in CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**', recursive=True) returns 'doesntexit/' instead of '' (easter egg: there is one bogus line right now) - Successfully tested on Windows. Requires hack replacing sh with pbs and disabling a couple unrelated imports though. Previous glob version failed to match, new files were never owned. - Remove (accidental?) limitation to run from the git top level. Moved git top-level command to a global and remove the same limitation on -m Checkpatch too. - Tighten the regular expression parsing CODEOWNERS lines to actually report invalid lines: make sure pattern starts in first character; make sure the owner is not mister whitespace. Performance: - Down from 1 sec to less than 0.5s in the most common case: not adding any new file. Fix broken optimization that still scanned the whole tree even when there are no new files (hint: [''] is not false) - About 20% faster on large sets without symbolic links (up to "infinitely" faster *with* symlinks) One possible reason: the list of owned files is now split per CODEOWNERS line for better logging which also means scanning smaller lists. Other minor fixes: - Remove unused git ls-files invocation, replaced with an optimization TODO. - Use more portable splitlines() instead of split('\n') - renamed parse_codeowners() to ls_owned_files() because but it does far more than parsing CODEOWNERS - Use "early returns" and extract new function to reduce cyclomatic complexity (up to 9 levels of indentation before in parse_codeowners(). Other minor simplifications. - Add some pydocs and logging Signed-off-by: Marc Herbert <marc.herbert@intel.com>
The existing note does already warn that "Using the ** pattern in large
directory trees may consume an inordinate amount of time", however
following symbolic links to directories brings that to entirely new
levels; learned the hard way.
I initially assumed this was a bug, tentatively fixed with a simple:
... in
glob.py#_iterdir(dirname, dironly)
:Later I found https://bugs.python.org/issue13968 in the git log and saw
this is working as intended with even a symlink loop test.