Skip to content

doc: recursive glob ** follows symlinks to directories #12918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 11, 2019

Conversation

marc-h38
Copy link
Contributor

@marc-h38 marc-h38 commented Apr 23, 2019

The existing note does already warn that "Using the ** pattern in large
directory trees may consume an inordinate amount of time", however
following symbolic links to directories brings that to entirely new
levels; learned the hard way.

I initially assumed this was a bug, tentatively fixed with a simple:

-   if not dironly or entry.is_dir():
+   if not dironly or entry.is_dir(follow_symlinks=False):

... in glob.py#_iterdir(dirname, dironly):

Later I found https://bugs.python.org/issue13968 in the git log and saw
this is working as intended with even a symlink loop test.

@bedevere-bot bedevere-bot added the docs Documentation in the Doc dir label Apr 23, 2019
@the-knights-who-say-ni
Copy link

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@marc-h38
Copy link
Contributor Author

cc @serhiy-storchaka

marc-hb added a commit to marc-hb/ci-tools that referenced this pull request Apr 26, 2019
The initial purpose was to replace glob.glob(**, recursive=True) with
pathlib.Path.glob(**) because the former follows symbolic links to
directories in recursive mode (including "infinite" loops), whereas the
latter (and more modern) doesn't. While trying to stick to this change,
testing found a significant number of other issues and addressing all of
them unfortunately turned into this significant rewrite. The entire
Codeowners check is too small to isolate every change in separate
commit, sorry for the resulting "bulk" review work load (~100 lines)

Here's a simplified and extreme reproduction:
- run a fairly extensive sanitycheck run. This creates 100,000s of files
  in sanity-out (thank you cmake)
- observe how check_compliance.py -m Codeowners now takes minutes
- create the following symbolic link: ln -s ../zephyr myself
- observe how check_compliance.py -m Codeowners now takes hours or more.

Only "hours" and not infinity because operating systems have limits to
break infinite loops of symbolic links.

Now considering Zephyr supports Windows and none of the multiple flavors
of symbolic links offered by Windows is usable, it's unlikely Zephyr
will ever _require_ symbolic links. However this shouldn't _forbid_
other users to use symbolic links privately.

Funny enough neither glob nor pathlib documents whether they follow
symbolic links to directories in recursive mode. "glob" does and
"pathlib" doesn't. So I thought this was bug in glob and even had a
tentative fix for it. Then I scanned the glob's git log and found it's
actually a "feature":
python/cpython#12918. glob's own documentation
mentions pathlib as a higher-level and newer alternative: let's switch
to pathlib.

Other changes:

- Fixed failing "path not found" detection of bogus lines in
  CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**',
  recursive=True) returns 'doesntexit/' instead of '' (easter egg: there
  is one bogus line right now)

- Successfully tested on Windows. Requires hack replacing sh with pbs
  and disabling a couple unrelated imports though. Previous glob version
  failed to match, new files were never owned.

- Remove (accidental?) limitation to run from the git top level. Moved
  git top-level command to a global and remove the same limitation on -m
  Checkpatch too.

- Tighten the regular expression parsing CODEOWNERS lines to actually
  report invalid lines: make sure pattern starts in first character;
  make sure the owner is not mister whitespace.

Performance:

- Down from 1 sec to less than 0.5s in the most common case: not adding
  any new file. Fix broken optimization that still scanned the whole
  tree even when there are no new files (hint: [''] is not false)

- About 20% faster on large sets without symbolic links (up to
  "infinitely" faster *with* symlinks) One possible reason: the list of
  owned files is now split per CODEOWNERS line for better logging which
  also means scanning smaller lists.

Other minor fixes:

- Remove unused git ls-files invocation, replaced with an optimization
  TODO.
- Use more portable splitlines() instead of split('\n')
- renamed parse_codeowners() to ls_owned_files() because but it does far
  more than parsing CODEOWNERS
- Use "early returns" and extract new function to reduce cyclomatic
  complexity (up to 9 levels of indentation before in
  parse_codeowners(). Other minor simplifications.
- Add some pydocs and logging

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
marc-hb added a commit to marc-hb/ci-tools that referenced this pull request Apr 29, 2019
The initial purpose was to replace glob.glob(**, recursive=True) with
pathlib.Path.glob(**) because the former follows symbolic links to
directories in recursive mode (including "infinite" loops), whereas the
latter (and more modern) doesn't. While trying to stick to this change,
testing found a significant number of other issues and addressing all of
them unfortunately turned into this significant rewrite. The entire
Codeowners check is too small to isolate every change in separate
commit, sorry for the resulting "bulk" review work load (~100 lines)

Here's a simplified and extreme reproduction:
- run a fairly extensive sanitycheck run. This creates 100,000s of files
  in sanity-out (thank you cmake)
- observe how check_compliance.py -m Codeowners now takes minutes
- create the following symbolic link: ln -s ../zephyr myself
- observe how check_compliance.py -m Codeowners now takes hours or more.

Only "hours" and not infinity because operating systems have limits to
break infinite loops of symbolic links.

Now considering Zephyr supports Windows and none of the multiple flavors
of symbolic links offered by Windows is usable, it's unlikely Zephyr
will ever _require_ symbolic links. However this shouldn't _forbid_
other users to use symbolic links privately.

Funny enough neither glob nor pathlib documents whether they follow
symbolic links to directories in recursive mode. "glob" does and
"pathlib" doesn't. So I thought this was bug in glob and even had a
tentative fix for it. Then I scanned the glob's git log and found it's
actually a "feature":
python/cpython#12918. glob's own documentation
mentions pathlib as a higher-level and newer alternative: let's switch
to pathlib.

Other changes:

- Fixed failing "path not found" detection of bogus lines in
  CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**',
  recursive=True) returns 'doesntexit/' instead of '' (easter egg: there
  is one bogus line right now)

- Successfully tested on Windows. Requires hack replacing sh with pbs
  and disabling a couple unrelated imports though. Previous glob version
  failed to match, new files were never owned.

- Remove (accidental?) limitation to run from the git top level. Moved
  git top-level command to a global and remove the same limitation on -m
  Checkpatch too.

- Tighten the regular expression parsing CODEOWNERS lines to actually
  report invalid lines: make sure pattern starts in first character;
  make sure the owner is not mister whitespace.

Performance:

- Down from 1 sec to less than 0.5s in the most common case: not adding
  any new file. Fix broken optimization that still scanned the whole
  tree even when there are no new files (hint: [''] is not false)

- About 20% faster on large sets without symbolic links (up to
  "infinitely" faster *with* symlinks) One possible reason: the list of
  owned files is now split per CODEOWNERS line for better logging which
  also means scanning smaller lists.

Other minor fixes:

- Remove unused git ls-files invocation, replaced with an optimization
  TODO.
- Use more portable splitlines() instead of split('\n')
- renamed parse_codeowners() to ls_owned_files() because but it does far
  more than parsing CODEOWNERS
- Use "early returns" and extract new function to reduce cyclomatic
  complexity (up to 9 levels of indentation before in
  parse_codeowners(). Other minor simplifications.
- Add some pydocs and logging

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
nashif pushed a commit to zephyrproject-rtos/ci-tools that referenced this pull request Apr 30, 2019
The initial purpose was to replace glob.glob(**, recursive=True) with
pathlib.Path.glob(**) because the former follows symbolic links to
directories in recursive mode (including "infinite" loops), whereas the
latter (and more modern) doesn't. While trying to stick to this change,
testing found a significant number of other issues and addressing all of
them unfortunately turned into this significant rewrite. The entire
Codeowners check is too small to isolate every change in separate
commit, sorry for the resulting "bulk" review work load (~100 lines)

Here's a simplified and extreme reproduction:
- run a fairly extensive sanitycheck run. This creates 100,000s of files
  in sanity-out (thank you cmake)
- observe how check_compliance.py -m Codeowners now takes minutes
- create the following symbolic link: ln -s ../zephyr myself
- observe how check_compliance.py -m Codeowners now takes hours or more.

Only "hours" and not infinity because operating systems have limits to
break infinite loops of symbolic links.

Now considering Zephyr supports Windows and none of the multiple flavors
of symbolic links offered by Windows is usable, it's unlikely Zephyr
will ever _require_ symbolic links. However this shouldn't _forbid_
other users to use symbolic links privately.

Funny enough neither glob nor pathlib documents whether they follow
symbolic links to directories in recursive mode. "glob" does and
"pathlib" doesn't. So I thought this was bug in glob and even had a
tentative fix for it. Then I scanned the glob's git log and found it's
actually a "feature":
python/cpython#12918. glob's own documentation
mentions pathlib as a higher-level and newer alternative: let's switch
to pathlib.

Other changes:

- Fixed failing "path not found" detection of bogus lines in
  CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**',
  recursive=True) returns 'doesntexit/' instead of '' (easter egg: there
  is one bogus line right now)

- Successfully tested on Windows. Requires hack replacing sh with pbs
  and disabling a couple unrelated imports though. Previous glob version
  failed to match, new files were never owned.

- Remove (accidental?) limitation to run from the git top level. Moved
  git top-level command to a global and remove the same limitation on -m
  Checkpatch too.

- Tighten the regular expression parsing CODEOWNERS lines to actually
  report invalid lines: make sure pattern starts in first character;
  make sure the owner is not mister whitespace.

Performance:

- Down from 1 sec to less than 0.5s in the most common case: not adding
  any new file. Fix broken optimization that still scanned the whole
  tree even when there are no new files (hint: [''] is not false)

- About 20% faster on large sets without symbolic links (up to
  "infinitely" faster *with* symlinks) One possible reason: the list of
  owned files is now split per CODEOWNERS line for better logging which
  also means scanning smaller lists.

Other minor fixes:

- Remove unused git ls-files invocation, replaced with an optimization
  TODO.
- Use more portable splitlines() instead of split('\n')
- renamed parse_codeowners() to ls_owned_files() because but it does far
  more than parsing CODEOWNERS
- Use "early returns" and extract new function to reduce cyclomatic
  complexity (up to 9 levels of indentation before in
  parse_codeowners(). Other minor simplifications.
- Add some pydocs and logging

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@serhiy-storchaka
Copy link
Member

It is technically correct, but since my English is bad I'm not sure about the best phrasing.

ulfalizer pushed a commit to ulfalizer/zephyr that referenced this pull request May 13, 2019
The initial purpose was to replace glob.glob(**, recursive=True) with
pathlib.Path.glob(**) because the former follows symbolic links to
directories in recursive mode (including "infinite" loops), whereas the
latter (and more modern) doesn't. While trying to stick to this change,
testing found a significant number of other issues and addressing all of
them unfortunately turned into this significant rewrite. The entire
Codeowners check is too small to isolate every change in separate
commit, sorry for the resulting "bulk" review work load (~100 lines)

Here's a simplified and extreme reproduction:
- run a fairly extensive sanitycheck run. This creates 100,000s of files
  in sanity-out (thank you cmake)
- observe how check_compliance.py -m Codeowners now takes minutes
- create the following symbolic link: ln -s ../zephyr myself
- observe how check_compliance.py -m Codeowners now takes hours or more.

Only "hours" and not infinity because operating systems have limits to
break infinite loops of symbolic links.

Now considering Zephyr supports Windows and none of the multiple flavors
of symbolic links offered by Windows is usable, it's unlikely Zephyr
will ever _require_ symbolic links. However this shouldn't _forbid_
other users to use symbolic links privately.

Funny enough neither glob nor pathlib documents whether they follow
symbolic links to directories in recursive mode. "glob" does and
"pathlib" doesn't. So I thought this was bug in glob and even had a
tentative fix for it. Then I scanned the glob's git log and found it's
actually a "feature":
python/cpython#12918. glob's own documentation
mentions pathlib as a higher-level and newer alternative: let's switch
to pathlib.

Other changes:

- Fixed failing "path not found" detection of bogus lines in
  CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**',
  recursive=True) returns 'doesntexit/' instead of '' (easter egg: there
  is one bogus line right now)

- Successfully tested on Windows. Requires hack replacing sh with pbs
  and disabling a couple unrelated imports though. Previous glob version
  failed to match, new files were never owned.

- Remove (accidental?) limitation to run from the git top level. Moved
  git top-level command to a global and remove the same limitation on -m
  Checkpatch too.

- Tighten the regular expression parsing CODEOWNERS lines to actually
  report invalid lines: make sure pattern starts in first character;
  make sure the owner is not mister whitespace.

Performance:

- Down from 1 sec to less than 0.5s in the most common case: not adding
  any new file. Fix broken optimization that still scanned the whole
  tree even when there are no new files (hint: [''] is not false)

- About 20% faster on large sets without symbolic links (up to
  "infinitely" faster *with* symlinks) One possible reason: the list of
  owned files is now split per CODEOWNERS line for better logging which
  also means scanning smaller lists.

Other minor fixes:

- Remove unused git ls-files invocation, replaced with an optimization
  TODO.
- Use more portable splitlines() instead of split('\n')
- renamed parse_codeowners() to ls_owned_files() because but it does far
  more than parsing CODEOWNERS
- Use "early returns" and extract new function to reduce cyclomatic
  complexity (up to 9 levels of indentation before in
  parse_codeowners(). Other minor simplifications.
- Add some pydocs and logging

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-h38
Copy link
Contributor Author

marc-h38 commented Jun 7, 2019

@vstinner , @ncoghlan , @bitdancer, could you review this one-line doc change or help find someone who can? Just picking up semi-random names from the git log and from https://bugs.python.org/issue13968, sorry.

@ncoghlan
Copy link
Contributor

ncoghlan commented Jun 9, 2019

My main question would be with the second half of the sentence: I think we follow symlinks regardless, but the revised wording makes it sound like appending / turns off the symlink following as well.

(I suspect there's another error in that wording as well: when os.altsep is defined, appending it should have the same effect as appending os.sep)

@marc-h38
Copy link
Contributor Author

marc-h38 commented Jun 19, 2019

but the revised wording makes it sound like appending / turns off the symlink following as well.

Good catch, thanks for the review. How about something like this then:

If *recursive* is true, the pattern ** will match any files and zero or
more directories, subdirectories and symbolic links to directories. If the
pattern is followed by an ``os.sep`` [or os.altsep] then files won't match.

I will test os.altsep. UPDATE: a trailing os.altsep has the same effect as a trailing os.sep. Will mention both.

@ncoghlan
Copy link
Contributor

@marc-h38 Aye, that revised wording looks good to me.

The existing note does already warn that "Using the ** pattern in large
directory trees may consume an inordinate amount of time", however
following symbolic links to directories brings that to entirely new
levels; learned the hard way.

I initially assumed this was a bug, tentatively fixed with a simple:

-   if not dironly or entry.is_dir():
+   if not dironly or entry.is_dir(follow_symlinks=False):

... in glob.py#_iterdir(dirname, dironly):

Later I found https://bugs.python.org/issue13968 in the git log and saw
this is working as intended with even a symlink loop test.
@marc-h38 marc-h38 force-pushed the doc-glob-recursive-symlinks branch from 070276d to fd5babb Compare July 8, 2019 04:42
@marc-h38
Copy link
Contributor Author

@ncoghlan , others: friendly ping? Should be a no-brainer now.

@serhiy-storchaka
Copy link
Member

Symlinks to files will not match as well. But symlinks to directories will match.

Copy link
Contributor

@aeros aeros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @marc-h38!

The proposed changes look good to me, but I would recommend some minor formatting adjustments:

Copy link
Contributor Author

@marc-h38 marc-h38 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symlinks to files will not match either (when there's a trailing '/')

The sole purpose of this commit is to highlight that recursive glob ** treats symlinks to directories like directories. That's because symlinks to directories are relatively "evil": for instance they can easily cause infinite loops. Learned the hard way. For comparison, pathlib does not recurse into symbolic links to directories.

On the other hand, there is nowhere near as much evil in symbolic link to files.

The current documentation does not make any reference to symbolic link to files, should I introduce this for the first time?

- ``os.sep`` or ``os.altsep``
+ :data:`os.sep` or :data:`os.altsep`

Co-Authored-By: Kyle Stanley <aeros167@gmail.com>
@marc-h38
Copy link
Contributor Author

marc-h38 commented Sep 3, 2019

@serhiy-storchaka do you think this should be expanded even more and should mention symbolic links to files too?

(The risk being of course that no correction at all ever happens due to loss of interest from all parties ;-)

@JulienPalard JulienPalard merged commit e24594b into python:master Sep 11, 2019
@miss-islington
Copy link
Contributor

Thanks @marc-h38 for the PR, and @JulienPalard for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7, 3.8.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry @marc-h38 and @JulienPalard, I had trouble checking out the 3.8 backport branch.
Please backport using cherry_picker on command line.
cherry_picker e24594bfe75aff3e654665cb940ddc4d4acffd2f 3.8

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 11, 2019
(cherry picked from commit e24594b)

Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
@bedevere-bot
Copy link

GH-15995 is a backport of this pull request to the 3.7 branch.

@JulienPalard
Copy link
Member

@marc-h38 thanks for enhancing the documentation, the current state of it looks good to me. Telling further about glob not "following symlink files" would probably just add too much and make the paragraph less easy to understand. Thanks!

@miss-islington
Copy link
Contributor

Thanks @marc-h38 for the PR, and @JulienPalard for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Sep 11, 2019
(cherry picked from commit e24594b)

Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
@bedevere-bot
Copy link

GH-15996 is a backport of this pull request to the 3.8 branch.

miss-islington added a commit that referenced this pull request Sep 11, 2019
(cherry picked from commit e24594b)

Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
miss-islington added a commit that referenced this pull request Sep 11, 2019
(cherry picked from commit e24594b)

Co-authored-by: Marc <Marc.Herbert+github@gmail.com>
DinoV pushed a commit to DinoV/cpython that referenced this pull request Sep 12, 2019
ulfalizer pushed a commit to ulfalizer/zephyr that referenced this pull request Oct 23, 2019
The initial purpose was to replace glob.glob(**, recursive=True) with
pathlib.Path.glob(**) because the former follows symbolic links to
directories in recursive mode (including "infinite" loops), whereas the
latter (and more modern) doesn't. While trying to stick to this change,
testing found a significant number of other issues and addressing all of
them unfortunately turned into this significant rewrite. The entire
Codeowners check is too small to isolate every change in separate
commit, sorry for the resulting "bulk" review work load (~100 lines)

Here's a simplified and extreme reproduction:
- run a fairly extensive sanitycheck run. This creates 100,000s of files
  in sanity-out (thank you cmake)
- observe how check_compliance.py -m Codeowners now takes minutes
- create the following symbolic link: ln -s ../zephyr myself
- observe how check_compliance.py -m Codeowners now takes hours or more.

Only "hours" and not infinity because operating systems have limits to
break infinite loops of symbolic links.

Now considering Zephyr supports Windows and none of the multiple flavors
of symbolic links offered by Windows is usable, it's unlikely Zephyr
will ever _require_ symbolic links. However this shouldn't _forbid_
other users to use symbolic links privately.

Funny enough neither glob nor pathlib documents whether they follow
symbolic links to directories in recursive mode. "glob" does and
"pathlib" doesn't. So I thought this was bug in glob and even had a
tentative fix for it. Then I scanned the glob's git log and found it's
actually a "feature":
python/cpython#12918. glob's own documentation
mentions pathlib as a higher-level and newer alternative: let's switch
to pathlib.

Other changes:

- Fixed failing "path not found" detection of bogus lines in
  CODEOWNERS. glob seems to have a bug where glob.glob('doesntexist/**',
  recursive=True) returns 'doesntexit/' instead of '' (easter egg: there
  is one bogus line right now)

- Successfully tested on Windows. Requires hack replacing sh with pbs
  and disabling a couple unrelated imports though. Previous glob version
  failed to match, new files were never owned.

- Remove (accidental?) limitation to run from the git top level. Moved
  git top-level command to a global and remove the same limitation on -m
  Checkpatch too.

- Tighten the regular expression parsing CODEOWNERS lines to actually
  report invalid lines: make sure pattern starts in first character;
  make sure the owner is not mister whitespace.

Performance:

- Down from 1 sec to less than 0.5s in the most common case: not adding
  any new file. Fix broken optimization that still scanned the whole
  tree even when there are no new files (hint: [''] is not false)

- About 20% faster on large sets without symbolic links (up to
  "infinitely" faster *with* symlinks) One possible reason: the list of
  owned files is now split per CODEOWNERS line for better logging which
  also means scanning smaller lists.

Other minor fixes:

- Remove unused git ls-files invocation, replaced with an optimization
  TODO.
- Use more portable splitlines() instead of split('\n')
- renamed parse_codeowners() to ls_owned_files() because but it does far
  more than parsing CODEOWNERS
- Use "early returns" and extract new function to reduce cyclomatic
  complexity (up to 9 levels of indentation before in
  parse_codeowners(). Other minor simplifications.
- Add some pydocs and logging

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir skip issue skip news
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants