scantree-0.0.1/0000755000076500000240000000000013447627322014630 5ustar andershussstaff00000000000000scantree-0.0.1/LICENSE0000644000076500000240000000205413441647066015637 0ustar andershussstaff00000000000000MIT License Copyright (c) 2019 Anders Huss Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. scantree-0.0.1/MANIFEST.in0000644000076500000240000000003213441647066016362 0ustar andershussstaff00000000000000include README.md LICENSE scantree-0.0.1/PKG-INFO0000644000076500000240000001240713447627322015731 0ustar andershussstaff00000000000000Metadata-Version: 2.1 Name: scantree Version: 0.0.1 Summary: Flexible recursive directory iterator: scandir meets glob("**", recursive=True) Home-page: https://github.com/andhus/scantree Author: Anders Huss Author-email: andhus@kth.se License: MIT Description: [![Build Status](https://travis-ci.com/andhus/scantree.svg?branch=master)](https://travis-ci.com/andhus/scantree) [![codecov](https://codecov.io/gh/andhus/scantree/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/scantree) # scantree Recursive directory iterator supporting: - flexible filtering including wildcard path matching - in memory representation of file-tree (for repeated access) - efficient access to directory entry properties (`posix.DirEntry` interface) extended with real path and path relative to the recursion root directory - detection and handling of cyclic symlinks ## Installation ```commandline pip install scantree ``` ## Usage See source code for full documentation, some generic examples below. Get matching file paths: ```python from scantree import scantree, RecursionFilter tree = scantree('/path/to/dir', RecursionFilter(match=['*.txt'])) print([path.relative for path in tree.filepaths()]) print([path.real for path in tree.filepaths()]) ``` ``` ['d1/d2/file3.txt', 'd1/file2.txt', 'file1.txt'] ['/path/to/other_dir/file3.txt', '/path/to/dir/d1/file2.txt', '/path/to/dir/file1.txt'] ``` Access metadata of directory entries in file tree: ```python d2 = tree.directories[0].directories[0] print(type(d2)) print(d2.path.absolute) print(d2.path.real) print(d2.path.is_symlink()) print(d2.files[0].relative) ``` ``` scantree._node.DirNode /path/to/dir/d1/d2 /path/to/other_dir True d1/d2/file3.txt ``` Aggregate information by operating on tree: ```python hello_count = tree.apply( file_apply=lambda path: sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]), dir_apply=lambda dir_: sum(dir_.entries), ) print(hello_count) ``` ``` 3 ``` ```python hello_count_tree = tree.apply( file_apply=lambda path: { 'name': path.name, 'count': sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]) }, dir_apply=lambda dir_: { 'name': dir_.path.name, 'count': sum(e['count'] for e in dir_.entries), 'sub_counts': [e for e in dir_.entries] }, ) from pprint import pprint pprint(hello_count_tree) ``` ``` {'count': 3, 'name': 'dir', 'sub_counts': [{'count': 2, 'name': 'file1.txt'}, {'count': 1, 'name': 'd1', 'sub_counts': [{'count': 1, 'name': 'file2.txt'}, {'count': 0, 'name': 'd2', 'sub_counts': [{'count': 0, 'name': 'file3.txt'}]}]}]} ``` Flexible filtering: ```python without_hidden_files = scantree('.', RecursionFilter(match=['*', '!.*'])) without_palindrome_linked_dirs = scantree( '.', lambda paths: [ p for p in paths if not ( p.is_dir() and p.is_symlink() and p.name == p.name[::-1] ) ] ) ``` Comparison: ```python tree = scandir('path/to/dir') # make some operations on filesystem, make sure file tree is the same: assert tree == scandir('path/to/dir') # tree contains absolute/real path info: import shutil shutil.copytree('path/to/dir', 'path/to/other_dir') new_tree = scandir('path/to/other_dir') assert tree != new_tree assert ( [p.relative for p in tree.leafpaths()] == [p.relative for p in new_tree.leafpaths()] ) ``` Inspect symlinks: ```python from scantree import CyclicLinkedDir file_links = [] dir_links = [] cyclic_links = [] def file_apply(path): if path.is_symlink(): file_links.append(path) def dir_apply(dir_node): if dir_node.path.is_symlink(): dir_links.append(dir_node.path) if isinstance(dir_node, CyclicLinkedDir): cyclic_links.append((dir_node.path, dir_node.target_path)) scantree('.', file_apply=file_apply, dir_apply=dir_apply) ``` Platform: UNKNOWN Description-Content-Type: text/markdown scantree-0.0.1/README.md0000644000076500000240000000746613447626666016136 0ustar andershussstaff00000000000000[![Build Status](https://travis-ci.com/andhus/scantree.svg?branch=master)](https://travis-ci.com/andhus/scantree) [![codecov](https://codecov.io/gh/andhus/scantree/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/scantree) # scantree Recursive directory iterator supporting: - flexible filtering including wildcard path matching - in memory representation of file-tree (for repeated access) - efficient access to directory entry properties (`posix.DirEntry` interface) extended with real path and path relative to the recursion root directory - detection and handling of cyclic symlinks ## Installation ```commandline pip install scantree ``` ## Usage See source code for full documentation, some generic examples below. Get matching file paths: ```python from scantree import scantree, RecursionFilter tree = scantree('/path/to/dir', RecursionFilter(match=['*.txt'])) print([path.relative for path in tree.filepaths()]) print([path.real for path in tree.filepaths()]) ``` ``` ['d1/d2/file3.txt', 'd1/file2.txt', 'file1.txt'] ['/path/to/other_dir/file3.txt', '/path/to/dir/d1/file2.txt', '/path/to/dir/file1.txt'] ``` Access metadata of directory entries in file tree: ```python d2 = tree.directories[0].directories[0] print(type(d2)) print(d2.path.absolute) print(d2.path.real) print(d2.path.is_symlink()) print(d2.files[0].relative) ``` ``` scantree._node.DirNode /path/to/dir/d1/d2 /path/to/other_dir True d1/d2/file3.txt ``` Aggregate information by operating on tree: ```python hello_count = tree.apply( file_apply=lambda path: sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]), dir_apply=lambda dir_: sum(dir_.entries), ) print(hello_count) ``` ``` 3 ``` ```python hello_count_tree = tree.apply( file_apply=lambda path: { 'name': path.name, 'count': sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]) }, dir_apply=lambda dir_: { 'name': dir_.path.name, 'count': sum(e['count'] for e in dir_.entries), 'sub_counts': [e for e in dir_.entries] }, ) from pprint import pprint pprint(hello_count_tree) ``` ``` {'count': 3, 'name': 'dir', 'sub_counts': [{'count': 2, 'name': 'file1.txt'}, {'count': 1, 'name': 'd1', 'sub_counts': [{'count': 1, 'name': 'file2.txt'}, {'count': 0, 'name': 'd2', 'sub_counts': [{'count': 0, 'name': 'file3.txt'}]}]}]} ``` Flexible filtering: ```python without_hidden_files = scantree('.', RecursionFilter(match=['*', '!.*'])) without_palindrome_linked_dirs = scantree( '.', lambda paths: [ p for p in paths if not ( p.is_dir() and p.is_symlink() and p.name == p.name[::-1] ) ] ) ``` Comparison: ```python tree = scandir('path/to/dir') # make some operations on filesystem, make sure file tree is the same: assert tree == scandir('path/to/dir') # tree contains absolute/real path info: import shutil shutil.copytree('path/to/dir', 'path/to/other_dir') new_tree = scandir('path/to/other_dir') assert tree != new_tree assert ( [p.relative for p in tree.leafpaths()] == [p.relative for p in new_tree.leafpaths()] ) ``` Inspect symlinks: ```python from scantree import CyclicLinkedDir file_links = [] dir_links = [] cyclic_links = [] def file_apply(path): if path.is_symlink(): file_links.append(path) def dir_apply(dir_node): if dir_node.path.is_symlink(): dir_links.append(dir_node.path) if isinstance(dir_node, CyclicLinkedDir): cyclic_links.append((dir_node.path, dir_node.target_path)) scantree('.', file_apply=file_apply, dir_apply=dir_apply) ```scantree-0.0.1/setup.cfg0000644000076500000240000000004613447627322016451 0ustar andershussstaff00000000000000[egg_info] tag_build = tag_date = 0 scantree-0.0.1/setup.py0000644000076500000240000000207213447625613016344 0ustar andershussstaff00000000000000import io import os from setuptools import setup, find_packages VERSION = '0.0.1' DESCRIPTION = ( 'Flexible recursive directory iterator: scandir meets glob("**", recursive=True)' ) PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__)) try: with io.open(os.path.join(PROJECT_ROOT, 'README.md'), encoding='utf-8') as f: long_description = '\n' + f.read() except IOError: long_description = DESCRIPTION setup( name='scantree', version=VERSION, description=DESCRIPTION, long_description=long_description, long_description_content_type="text/markdown", url='https://github.com/andhus/scantree', author="Anders Huss", author_email="andhus@kth.se", license='MIT', install_requires=[ 'attrs>=18.0.0', 'six>=1.12.0', 'scandir>=1.9.0;python_version<"3.5"', 'pathlib2>=2.3.3;python_version<"3.4"', 'pathspec>=0.5.9' ], packages=find_packages('src'), package_dir={'': 'src'}, include_package_data=True, entry_points={}, tests_require=['pytest', 'pytest-cov'] ) scantree-0.0.1/src/0000755000076500000240000000000013447627322015417 5ustar andershussstaff00000000000000scantree-0.0.1/src/scantree/0000755000076500000240000000000013447627322017223 5ustar andershussstaff00000000000000scantree-0.0.1/src/scantree/__init__.py0000644000076500000240000000043613447625613021340 0ustar andershussstaff00000000000000from __future__ import print_function, division from ._path import ( RecursionPath, DirEntryReplacement ) from ._node import ( DirNode, LinkedDir, CyclicLinkedDir ) from ._filter import RecursionFilter from ._scan import ( scantree, SymlinkRecursionError ) scantree-0.0.1/src/scantree/_filter.py0000644000076500000240000000545113447625613021227 0ustar andershussstaff00000000000000from __future__ import print_function, division from pathspec import PathSpec from pathspec.util import normalize_file, match_file from pathspec.patterns import GitWildMatchPattern class RecursionFilter(object): """Callable object for filtering of sequence of `RecursionPath`:s. Intended for use as `recursion_filter` argument in `scantree`. # Arguments: linked_dirs (bool): Whether to include linked directories. Default True. linked_files (bool): Whether to include linked files. Default True. match ([str] | None): List of gitignore-style wildcard match patterns. The `RecursionPath.relative` path must match at least one of the patterns not starting with `'!'` and none of the patterns starting with `'!'`. Matching is done based on the `pathspec` library implementation (https://github.com/cpburnz/python-path-specification). Default `None` which is equivalent to ['*'] matching all file paths. """ def __init__( self, linked_dirs=True, linked_files=True, match=None, ): self.linked_dirs = linked_dirs self.linked_files = linked_files self._match_patterns = tuple('*') if match is None else tuple(match) if self._match_patterns != tuple('*'): self._path_spec = PathSpec.from_lines( GitWildMatchPattern, self.match_patterns ) else: self._path_spec = None @property def match_patterns(self): return self._match_patterns def include(self, recursion_path): if recursion_path.is_symlink(): if recursion_path.is_dir() and not self.linked_dirs: return False if recursion_path.is_file() and not self.linked_files: return False if recursion_path.is_dir(): # only filepaths are matched against patterns return True return self.match_file(recursion_path.relative) def match_file(self, filepath): """Match file against match patterns. NOTE: only match patterns are considered, not the `linked_files` argument of this class. # Arguments: filepath (str): the path to match. # Returns: Boolean, whether the path is a match or not. """ if self._path_spec is None: return True return match_file(self._path_spec.patterns, normalize_file(filepath)) def __call__(self, paths): """Filter recursion paths. # Arguments: paths ([RecursionPath]): The recursion paths to filter. # Returns: A generator of (filtered) recursion paths. """ for path in paths: if self.include(path): yield path scantree-0.0.1/src/scantree/_node.py0000644000076500000240000001606613447625613020673 0ustar andershussstaff00000000000000from __future__ import print_function, division import attr from ._path import RecursionPath @attr.s(slots=True, frozen=True) class DirNode(object): """A directory node in a Directed Acyclic Graph (DAG) representing a file system tree. NOTE: this class is normally only ever instantiated by the `scantree` function. # Arguments: path (RecursionPath): The recursion path to the directory. directories ([object]): The result of `scantree` `dir_apply` argument applied to the subdirectories of this directory. files ([object]): The result of `scantree` `file_apply` argument applied to the files of this directory. """ path = attr.ib(validator=attr.validators.instance_of(RecursionPath)) files = attr.ib(default=tuple(), converter=tuple) directories = attr.ib(default=tuple(), converter=tuple) @property def empty(self): """Boolean: does this directory node have any files or subdirectories.""" return not (self.files or self.directories) @property def entries(self): """Tuple of files followed by directories.""" return self.files + self.directories def apply(self, dir_apply, file_apply): """Operate on the file tree under this directory node recursively. # Arguments: file_apply (f: f(object) -> object): The function to apply to the to each file. Default "identity", i.e. `lambda x: x`. dir_apply (f: f(DirNode) -> object): The function to apply to the `DirNode` for each (sub) directory. Default "identity", i.e. `lambda x: x`. # Returns: The `object` returned by `dir_apply` on this `DirNode` after recursive application of `file_apply` and `dir_apply` on its subdirectories and files. """ dir_node = DirNode( self.path, [dir_.apply(dir_apply, file_apply) for dir_ in self.directories], [file_apply(file_) for file_ in self.files] ) return dir_apply(dir_node) def leafpaths(self): """Get the leafs of the file tree under this directory node. # Returns: A list of `RecursionPaths` sorted on relative path. If the tree contains empty directories, `LinkedDir` or `CyclicLinkedDir` nodes these will be included. If none of these are present (which is the case for the result of `scantree('.', include_empty=False, follow_links=True, allow_cyclic_links=False)`) this method will only return paths to the files, i.e. the same as the `filepaths` method. NOTE: `LinkedDir` and `CyclicLinkedDir` nodes are considered leafs since they are leafs in the actual DAG data structure, even though they are not necessarily leafs in terms of the underlying file-system structure that they represent. """ leafs = [] def file_apply(path): leafs.append(path) def dir_apply(dir_node): if isinstance(dir_node, (LinkedDir, CyclicLinkedDir)) or dir_node.empty: leafs.append(dir_node.path) self.apply(dir_apply=dir_apply, file_apply=file_apply) return sorted(leafs, key=lambda path: path.relative) def filepaths(self): """Get the filepaths of the file tree under this directory. # Returns: A list of `RecursionPaths` sorted on relative path. """ files = [] def file_apply(path): files.append(path) self.apply(dir_apply=identity, file_apply=file_apply) return sorted(files, key=lambda path: path.relative) @attr.s(slots=True, frozen=True) class LinkedDir(object): """This node represents a symbolic link to a directory. It is created by `scantree` to represent a linked directory when the argument `follow_links` is set tot `False`. NOTE: this class is normally only ever instantiated by the `scantree` function. # Arguments: path (RecursionPath): The recursion path to the *link* to a directory. """ path = attr.ib(validator=attr.validators.instance_of(RecursionPath)) @property def directories(self): raise AttributeError( '`directories` is undefined for `LinkedDir` nodes. Use e.g. ' '`[de for de in scandir(linked_dir.path.real) if de.is_dir()]` ' 'to get a list of the sub directories of the linked directory' ) @property def files(self): raise AttributeError( '`files` is undefined for `LinkedDir` nodes. Use e.g. ' '`[de for de in scandir(linked_dir.path.real) if de.is_file()]` ' ' to get a list of the files of the linked directory' ) @property def entries(self): raise AttributeError( '`entries` is undefined for `LinkedDir` nodes. Use e.g. ' '`scandir(linked_dir.path.real)` to get the entries of the linked ' 'directory' ) @property def empty(self): raise AttributeError('`empty` is undefined for `LinkedDir` nodes.') def apply(self, dir_apply, file_apply=None): return dir_apply(self) @attr.s(slots=True, frozen=True) class CyclicLinkedDir(object): """This node represents a symbolic link causing a cycle of symlinks. It is created by `scantree` to represent a cyclic links when the argument `allow_cyclic_links` is set tot `True`. NOTE: this class is normally only ever instantiated by the `scantree` function. # Arguments: path (RecursionPath): The recursion path to the *symlink* to a directory (which is a parent of this directory). target_path (RecursionPath): The recursion path to the target directory of the link (which is a parent of this directory). """ path = attr.ib(validator=attr.validators.instance_of(RecursionPath)) target_path = attr.ib(validator=attr.validators.instance_of(RecursionPath)) @property def directories(self): raise AttributeError( '`directories` is undefined for `CyclicLinkedDir` to avoid infinite ' 'recursion. `target_path` property contains the `RecursionPath` for the ' 'target directory.' ) @property def files(self): raise AttributeError( '`files` is undefined for `CyclicLinkedDir` to avoid infinite ' 'recursion. `target_path` property contains the `RecursionPath` for the ' 'target directory.' ) @property def entries(self): raise AttributeError( '`entries` is undefined for `CyclicLinkedDir` to avoid infinite ' 'recursion. `target_path` property contains the `RecursionPath` for the ' 'target directory.' ) @property def empty(self): """A cyclic linked dir is never empty.""" return False def apply(self, dir_apply, file_apply=None): return dir_apply(self) def is_empty_dir_node(dir_node): return isinstance(dir_node, DirNode) and dir_node.empty def identity(x): return x scantree-0.0.1/src/scantree/_path.py0000644000076500000240000001562213447625613020677 0ustar andershussstaff00000000000000from __future__ import print_function, division import os import attr from .compat import ( Path, fspath, scandir, DirEntry, ) @attr.s(slots=True) # TODO consider make frozen. class RecursionPath(object): """Caches the properties of directory entries including the path relative to the root directory for recursion. NOTE: this class is normally only ever instantiated by the `scantree` function. The class provides the `DirEntry` interface (found in the external `scandir` module in Python < 3.5 or builtin `posix` module in Python >= 3.5). """ root = attr.ib() relative = attr.ib() real = attr.ib() _dir_entry = attr.ib(cmp=False) @classmethod def from_root(cls, directory): """Instantiate a `RecursionPath` from given directory.""" if isinstance(directory, (DirEntry, DirEntryReplacement)): dir_entry = directory else: dir_entry = DirEntryReplacement.from_path(directory) return cls( root=dir_entry.path, relative='', real=os.path.realpath(dir_entry.path), dir_entry=dir_entry ) def scandir(self): """Scan the underlying directory. # Returns: A generator of `RecursionPath`:s representing the directory entries. """ return (self._join(dir_entry) for dir_entry in scandir(self.absolute)) def _join(self, dir_entry): relative = os.path.join(self.relative, dir_entry.name) real = os.path.join(self.real, dir_entry.name) if dir_entry.is_symlink(): # For large number of files/directories it improves performance # significantly to only call `os.realpath` when we are actually # encountering a symlink. real = os.path.realpath(real) return attr.evolve(self, relative=relative, real=real, dir_entry=dir_entry) @property def absolute(self): """The absolute path to this entry""" if self.relative == '': return self.root # don't join in this case as that appends trailing '/' return os.path.join(self.root, self.relative) @property def path(self): """The path property according `DirEntry` interface. NOTE: this property is only here to fully implement the `DirEntry` interface (which is useful in comparison etc.). It is recommended to use one on of (the well defined) `real`, `relative` or `absolute` properties instead. """ return self._dir_entry.path @property def name(self): return self._dir_entry.name def is_dir(self, follow_symlinks=True): return self._dir_entry.is_dir(follow_symlinks=follow_symlinks) def is_file(self, follow_symlinks=True): return self._dir_entry.is_file(follow_symlinks=follow_symlinks) def is_symlink(self): return self._dir_entry.is_symlink() def stat(self, follow_symlinks=True): return self._dir_entry.stat(follow_symlinks=follow_symlinks) def inode(self): return self._dir_entry.inode() def __fspath__(self): return self.absolute def as_pathlib(self): """Get a pathlib version of this path.""" return Path(self.absolute) @staticmethod def _getstate(self): return ( self.root, self.relative, self.real, DirEntryReplacement.from_dir_entry(self._dir_entry) ) @staticmethod def _setstate(self, state): self.root, self.relative, self.real, self._dir_entry = state # Attrs overrides __get/setstate__ for slotted classes, see: # https://github.com/python-attrs/attrs/issues/512 RecursionPath.__getstate__ = RecursionPath._getstate RecursionPath.__setstate__ = RecursionPath._setstate @attr.s(slots=True, cmp=False) class DirEntryReplacement(object): """Pure python implementation of the `DirEntry` interface (found in the external `scandir` module in Python < 3.5 or builtin `posix` module in Python >= 3.5) A `DirEntry` cannot be instantiated directly (only returned from a call to `scandir`). This class offers a drop in replacement. Useful in testing and for representing the root directory for `scantree` implementation. """ path = attr.ib(converter=fspath) name = attr.ib() _is_dir = attr.ib(init=False, default=None) _is_file = attr.ib(init=False, default=None) _is_symlink = attr.ib(init=False, default=None) _stat_sym = attr.ib(init=False, default=None) _stat_nosym = attr.ib(init=False, default=None) @classmethod def from_path(cls, path): path = fspath(path) if not os.path.exists(path): raise IOError('{} does not exist'.format(path)) basename = os.path.basename(path) if basename in ['', '.', '..']: name = os.path.basename(os.path.realpath(path)) else: name = basename return cls(path, name) @classmethod def from_dir_entry(cls, dir_entry): return cls(dir_entry.path, dir_entry.name) def is_dir(self, follow_symlinks=True): if self._is_dir is None: self._is_dir = os.path.isdir(self.path) if follow_symlinks: return self._is_dir else: return self._is_dir and not self.is_symlink() def is_file(self, follow_symlinks=True): if self._is_file is None: self._is_file = os.path.isfile(self.path) if follow_symlinks: return self._is_file else: return self._is_file and not self.is_symlink() def is_symlink(self): if self._is_symlink is None: self._is_symlink = os.path.islink(self.path) return self._is_symlink def stat(self, follow_symlinks=True): if follow_symlinks: if self._stat_sym is None: self._stat_sym = os.stat(self.path) return self._stat_sym if self._stat_nosym is None: self._stat_nosym = os.lstat(self.path) return self._stat_nosym def inode(self): return self.stat(follow_symlinks=False).st_ino def __eq__(self, other): if not isinstance(other, (DirEntryReplacement, DirEntry)): return False if not self.path == other.path: return False if not self.name == other.name: return False for method, kwargs in [ ('is_dir', {'follow_symlinks': True}), ('is_dir', {'follow_symlinks': False}), ('is_file', {'follow_symlinks': True}), ('is_file', {'follow_symlinks': False}), ('is_symlink', {}), ('stat', {'follow_symlinks': True}), ('stat', {'follow_symlinks': False}), ('inode', {}) ]: this_res = getattr(self, method)(**kwargs) other_res = getattr(other, method)(**kwargs) if not this_res == other_res: return False return True scantree-0.0.1/src/scantree/_scan.py0000644000076500000240000003330113447625613020661 0ustar andershussstaff00000000000000from __future__ import print_function, division import os from multiprocessing.pool import Pool from pathspec import RecursionError as _RecursionError from .compat import fspath from ._node import ( DirNode, LinkedDir, CyclicLinkedDir, identity, is_empty_dir_node ) from ._path import RecursionPath def scantree( directory, recursion_filter=identity, file_apply=identity, dir_apply=identity, follow_links=True, allow_cyclic_links=True, cache_file_apply=False, include_empty=False, jobs=1 ): """Recursively scan the file tree under the given directory. The files and subdirectories in each directory will be used to initialize a the object: `DirNode(path=..., files=[...], directories=[...])`, where `path` is the `RecursionPath` to the directory (relative to the root directory of the recursion), `files` is a list of the results of `file_apply` called on the recursion path of each file, and `directories` is a list of the results of `dir_apply` called on each `DirNode` obtained (recursively) for each subdirectory. Hence, with the default value (identity function) for `file_apply` and `dir_apply`, a tree-like data structure is returned representing the file tree of the scanned directory, with all relevant metadata *cached in memory*. This example illustrates the core concepts: ``` >>> tree = scantree('/path/to/dir') >>> tree.directories[0].directories[0].path.absolute '/path/to/dir/sub_dir_0/sub_sub_dir_0' >>> tree.directories[0].directories[0].path.relative 'sub_dir_0/sub_sub_dir_0' >>> tree.directories[0].files[0].relative 'sub_dir_0/file_0' >>> tree.directories[0].path.real '/path/to/linked_dir/' >>> tree.directories[0].path.is_symlink() # already cached, no OS call needed True ``` By providing a different `dir_apply` and `file_apply` function, you can operate on the paths and/or data of files while scanning the directory recursively. If `dir_apply` returns some aggregate or nothing (i.e. `None`) the full tree will never be stored in memory. The same result can be obtained by calling `tree.apply(dir_apply=..., file_apply=...)` but this can be done repeatedly without having to rerun expensive OS calls. # Arguments: directory (str | os.PathLike): The directory to scan. recursion_filter (f: f([RecursionPath]) -> [RecursionPath]): A filter function, defining which files to include and which subdirectories to scan, e.g. an instance of `scantree.RecursionFilter`. The `RecursionPath` implements the `DirEntry` interface (found in the external `scandir` module in Python < 3.5 or builtin `posix` module in Python >= 3.5). It caches metadata efficiently and, in addition to DirEntry, provides real path and path relative to the root directory for the recursion as properties, see `scantree.RecursionPath` for further details. file_apply (f: f(RecursionPath) -> object): The function to apply to the `RecursionPath` for each file. Default "identity", i.e. `lambda x: x`. dir_apply (f: f(DirNode) -> object): The function to apply to the `DirNode` for each (sub) directory. Default "identity", i.e. `lambda x: x`. follow_links (bool): Whether to follow symbolic links for not, i.e. to continue the recursive scanning in linked directories. If False, linked directories are represented by the `LinkedDir` object which does e.g. not have the `files` and `directories` properties (as these cannot be known without following the link). Default `True`. allow_cyclic_links (bool): If set to `False`, a `SymlinkRecursionError` is raised on detection of cyclic symbolic links, if `True` (default), the cyclic link is represented by a `CyclicLinkedDir` object. See "Cyclic Links Handling" section below for further details. cache_file_apply: If set to `True`, the `file_apply` result will be cached by *real* path. Default `False`. include_empty (bool): If set to `True`, empty directories are included in the result of the recursive scanning, represented by an empty directory node: `DirNode(directories=[], files=[])`. If `False` (default), empty directories are not included in the parent directory node (and subsequently never passed to `dir_apply`). jobs (int | None): If `1` (default), no multiprocessing is used. If jobs > 1, the number of processes to use for parallelizing `file_apply` over included files. If `None`, `os.cpu_count()` number of processes are used. NOTE: if jobs is `None` or > 1, the entire file tree will first be stored in memory before applying `file_apply` and `dir_apply`. # Returns: The `object` returned by `dir_apply` on the `DirNode` for the top level `directory`. If the default value ("identity" function: `lambda x: x`) is used for `dir_apply`, it will be the `DirNode` representing the root node of the file tree. # Raises: SymlinkRecursionError: if `allow_cyclic_links=False` and any cyclic symbolic links are detected. # Cyclic Links Handling: Symbolically linked directories can create cycles in the, otherwise acyclic, graph representing the file tree. If not handled properly, this leads to infinite recursion when traversing the file tree (this is e.g. the case for Python's built-in `os.walk(directory, followlinks=True)`). Sometimes multiple links form cycles together, therefore - without loss of generality - cyclic links are defined as: The first occurrence of a link to a directory that has already been visited on the current branch of recursion. With `allow_cyclic_links=True` any link to such a directory is represented by the object `CyclicLinkedDir(path=..., target_path=...)` where `path` is the `RecursionPath` to the link and `target_path` the `RecursionPath` to the parent directory that is the target of the link. In the example below there are cycles on all branches A/B, A/C and D. root/ |__A/ | |__B/ | | |__toA@ -> .. | |__C/ | |__toA@ -> .. |__D/ |__toB@ -> ../A/B In this case, the symlinks with relative paths A/B/toA, A/C/toA and D/toB/toA/B/toA will be represented by a `CyclicLinkedDir` object. Note that for the third branch, the presence of cyclic links can be *detected* already at D/toB/toA/B (since B is already visited) but it is D/toB/toA/B/toA which is considered a cyclic link (and gets represented by a `CyclicLinkedDir`). This reflects the fact that it is the toA that's "causing" the cycle, not D/toB or D/toB/toA/B (which is not even a link), and at D/toB/toA/ the cycle can not yet be detected. Below is another example where multiple links are involved in forming cycles as well as links which absolute path is external to the root directory for the recursion. In this case the symlinks with relative paths A/toB/toA, B/toA/toB and C/toD/toC are considered cyclic links for `scandir('/path/to/root')`. /path/to/root/ |__A/ | |__toB@ -> ../B |__B/ | |__toA@ -> /path/to/root/A |__C/ |__toD@ -> /path/to/D /path/to/D/ |__toC@ -> /path/to/root/C """ _verify_is_directory(directory) if jobs is None or jobs > 1: return _scantree_multiprocess(**vars()) path = RecursionPath.from_root(directory) if cache_file_apply: file_apply = _cached_by_realpath(file_apply) root_dir_node = _scantree_recursive( path=path, recursion_filter=recursion_filter, file_apply=file_apply, dir_apply=dir_apply, follow_links=follow_links, allow_cyclic_links=allow_cyclic_links, include_empty=include_empty, parents={path.real: path}, ) result = dir_apply(root_dir_node) return result def _scantree_multiprocess(**kwargs): """Multiprocess implementation of scantree. Note that it is only the `file_apply` function that is parallelized. """ file_apply = kwargs.pop('file_apply') dir_apply = kwargs.pop('dir_apply') jobs = kwargs.pop('jobs') file_paths = [] def extract_paths(path): result_idx = len(file_paths) file_paths.append(path) return result_idx root_dir_node = scantree(file_apply=extract_paths, dir_apply=identity, **kwargs) pool = Pool(jobs) try: file_results = pool.map(file_apply, file_paths) finally: pool.close() def fetch_result(result_idx): return file_results[result_idx] return root_dir_node.apply(dir_apply=dir_apply, file_apply=fetch_result) def _verify_is_directory(directory): """Verify that `directory` path exists and is a directory, otherwise raise ValueError""" directory = fspath(directory) if not os.path.exists(directory): raise ValueError('{}: No such directory'.format(directory)) if not os.path.isdir(directory): raise ValueError('{}: Is not a directory'.format(directory)) def _cached_by_realpath(file_apply): """Wrapps the `file_apply` function with a cache, if `path.real` is already in the cache, the cached value is returned""" cache = {} def file_apply_cached(path): if path.real not in cache: cache[path.real] = file_apply(path) return cache[path.real] return file_apply_cached def _scantree_recursive( path, recursion_filter, file_apply, dir_apply, follow_links, allow_cyclic_links, include_empty, parents, ): """The underlying recursive implementation of scantree. # Arguments: path (RecursionPath): the recursion path relative the directory where recursion was initialized. recursion_filter (f: f([RecursionPath]) -> [RecursionPath]): A filter function, defining which files to include and which subdirectories to scan, e.g. an instance of `scantree.RecursionFilter`. file_apply (f: f(RecursionPath) -> object): The function to apply to the `RecursionPath` for each file. Default "identity", i.e. `lambda x: x`. dir_apply (f: f(DirNode) -> object): The function to apply to the `DirNode` for each (sub) directory. Default "identity", i.e. `lambda x: x`. follow_links (bool): Whether to follow symbolic links for not, i.e. to continue the recursive scanning in linked directories. If False, linked directories are represented by the `LinkedDir` object which does e.g. not have the `files` and `directories` properties (as these cannot be known without following the link). Default `True`. allow_cyclic_links (bool): If set to `False`, a `SymlinkRecursionError` is raised on detection of cyclic symbolic links, if `True` (default), the cyclic link is represented by a `CyclicLinkedDir` object. include_empty (bool): If set to `True`, empty directories are included in the result of the recursive scanning, represented by an empty directory node: `DirNode(directories=[], files=[])`. If `False` (default), empty directories are not included in the parent directory node (and subsequently never passed to `dir_apply`). parents ({str: RecursionPath}): Mapping from real path (`str`) to `RecursionPath` of parent directories. # Returns: `DirNode` for the directory at `path`. # Raises: SymlinkRecursionError: if `allow_cyclic_links=False` and any cyclic symbolic links are detected. """ fwd_kwargs = vars() del fwd_kwargs['path'] if path.is_symlink(): if not follow_links: return LinkedDir(path) previous_path = parents.get(path.real, None) if previous_path is not None: if allow_cyclic_links: return CyclicLinkedDir(path, previous_path) else: raise SymlinkRecursionError(path, previous_path) if follow_links: parents[path.real] = path dirs = [] files = [] for subpath in sorted(recursion_filter(path.scandir())): if subpath.is_dir(): dir_node = _scantree_recursive(subpath, **fwd_kwargs) if include_empty or not is_empty_dir_node(dir_node): dirs.append(dir_apply(dir_node)) if subpath.is_file(): files.append(file_apply(subpath)) if follow_links: del parents[path.real] return DirNode(path=path, directories=dirs, files=files) class SymlinkRecursionError(_RecursionError): """Raised when symlinks cause a cyclic graph of directories. Extends the `pathspec.util.RecursionError` but with a different name (avoid overriding the built-in error!) and with a more informative string representation (used in `dirhash.cli`). """ def __init__(self, path, target_path): super(SymlinkRecursionError, self).__init__( real_path=path.real, first_path=os.path.join(target_path.root, target_path.relative), second_path=os.path.join(path.root, path.relative) ) def __str__(self): # _RecursionError.__str__ prints args without context return 'Symlink recursion: {}'.format(self.message) scantree-0.0.1/src/scantree/compat.py0000644000076500000240000000157613447625613021072 0ustar andershussstaff00000000000000from __future__ import print_function, division import sys from six import string_types def fspath(path): """In python 2: os.path... and scandir does not support PathLike objects""" if isinstance(path, string_types): return path if hasattr(path, '__fspath__'): return path.__fspath__() raise TypeError('Object {} is not a path'.format(path)) # Use the built-in version of scandir if possible (python > 3.5), # otherwise use the scandir module version try: from os import scandir from posix import DirEntry except ImportError: from scandir import scandir as _scandir from scandir import DirEntry def scandir(path, *args, **kwargs): if path is not None: path = fspath(path) return _scandir(path, *args, **kwargs) if sys.version_info >= (3, 4): from pathlib import Path else: from pathlib2 import Path scantree-0.0.1/src/scantree/test_utils.py0000644000076500000240000000514213447625613021777 0ustar andershussstaff00000000000000from __future__ import print_function, division import os from ._node import DirNode, LinkedDir, CyclicLinkedDir from ._path import RecursionPath, DirEntryReplacement def assert_dir_entry_equal(de1, de2): # TODO check has attributes assert de1.path == de2.path assert de1.name == de2.name for method, kwargs in [ ('is_dir', {'follow_symlinks': True}), ('is_dir', {'follow_symlinks': False}), ('is_file', {'follow_symlinks': True}), ('is_file', {'follow_symlinks': False}), ('is_symlink', {}), ('stat', {'follow_symlinks': True}), ('stat', {'follow_symlinks': False}), ('inode', {}) ]: for attempt in [1, 2]: # done two times to verify caching! res1 = getattr(de1, method)(**kwargs) res2 = getattr(de2, method)(**kwargs) if not res1 == res2: raise AssertionError( '\nde1.{method}(**{kwargs}) == {res1} != ' '\nde2.{method}(**{kwargs}) == {res2} ' '\n(attempt: {attempt})' '\nde1: {de1}' '\nde2: {de2}'.format( method=method, kwargs=kwargs, res1=res1, res2=res2, attempt=attempt, de1=de1, de2=de2 ) ) def assert_recursion_path_equal(p1, p2): assert p1.root == p2.root assert p1.relative == p2.relative assert p1.real == p2.real assert p1.absolute == p2.absolute assert_dir_entry_equal(p1, p2) def assert_dir_node_equal(dn1, dn2): assert_recursion_path_equal(dn1.path, dn2.path) if isinstance(dn1, LinkedDir): assert isinstance(dn2, LinkedDir) elif isinstance(dn1, CyclicLinkedDir): assert isinstance(dn2, CyclicLinkedDir) assert_recursion_path_equal(dn1.target_path, dn2.target_path) else: for path1, path2 in zip(dn1.files, dn2.files): assert_recursion_path_equal(path1, path2) for sub_dn1, sub_dn2 in zip(dn1.directories, dn2.directories): assert_dir_node_equal(sub_dn1, sub_dn2) def get_mock_recursion_path(relative, root=None, is_dir=False, is_symlink=False): dir_entry = DirEntryReplacement( path=relative, name=os.path.basename(relative) ) dir_entry._is_dir = is_dir dir_entry._is_file = not is_dir dir_entry._is_symlink = is_symlink return RecursionPath( root=root, relative=relative, real=None, dir_entry=dir_entry ) scantree-0.0.1/src/scantree.egg-info/0000755000076500000240000000000013447627322020715 5ustar andershussstaff00000000000000scantree-0.0.1/src/scantree.egg-info/PKG-INFO0000644000076500000240000001240713447627322022016 0ustar andershussstaff00000000000000Metadata-Version: 2.1 Name: scantree Version: 0.0.1 Summary: Flexible recursive directory iterator: scandir meets glob("**", recursive=True) Home-page: https://github.com/andhus/scantree Author: Anders Huss Author-email: andhus@kth.se License: MIT Description: [![Build Status](https://travis-ci.com/andhus/scantree.svg?branch=master)](https://travis-ci.com/andhus/scantree) [![codecov](https://codecov.io/gh/andhus/scantree/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/scantree) # scantree Recursive directory iterator supporting: - flexible filtering including wildcard path matching - in memory representation of file-tree (for repeated access) - efficient access to directory entry properties (`posix.DirEntry` interface) extended with real path and path relative to the recursion root directory - detection and handling of cyclic symlinks ## Installation ```commandline pip install scantree ``` ## Usage See source code for full documentation, some generic examples below. Get matching file paths: ```python from scantree import scantree, RecursionFilter tree = scantree('/path/to/dir', RecursionFilter(match=['*.txt'])) print([path.relative for path in tree.filepaths()]) print([path.real for path in tree.filepaths()]) ``` ``` ['d1/d2/file3.txt', 'd1/file2.txt', 'file1.txt'] ['/path/to/other_dir/file3.txt', '/path/to/dir/d1/file2.txt', '/path/to/dir/file1.txt'] ``` Access metadata of directory entries in file tree: ```python d2 = tree.directories[0].directories[0] print(type(d2)) print(d2.path.absolute) print(d2.path.real) print(d2.path.is_symlink()) print(d2.files[0].relative) ``` ``` scantree._node.DirNode /path/to/dir/d1/d2 /path/to/other_dir True d1/d2/file3.txt ``` Aggregate information by operating on tree: ```python hello_count = tree.apply( file_apply=lambda path: sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]), dir_apply=lambda dir_: sum(dir_.entries), ) print(hello_count) ``` ``` 3 ``` ```python hello_count_tree = tree.apply( file_apply=lambda path: { 'name': path.name, 'count': sum([ w.lower() == 'hello' for w in path.as_pathlib().read_text().split() ]) }, dir_apply=lambda dir_: { 'name': dir_.path.name, 'count': sum(e['count'] for e in dir_.entries), 'sub_counts': [e for e in dir_.entries] }, ) from pprint import pprint pprint(hello_count_tree) ``` ``` {'count': 3, 'name': 'dir', 'sub_counts': [{'count': 2, 'name': 'file1.txt'}, {'count': 1, 'name': 'd1', 'sub_counts': [{'count': 1, 'name': 'file2.txt'}, {'count': 0, 'name': 'd2', 'sub_counts': [{'count': 0, 'name': 'file3.txt'}]}]}]} ``` Flexible filtering: ```python without_hidden_files = scantree('.', RecursionFilter(match=['*', '!.*'])) without_palindrome_linked_dirs = scantree( '.', lambda paths: [ p for p in paths if not ( p.is_dir() and p.is_symlink() and p.name == p.name[::-1] ) ] ) ``` Comparison: ```python tree = scandir('path/to/dir') # make some operations on filesystem, make sure file tree is the same: assert tree == scandir('path/to/dir') # tree contains absolute/real path info: import shutil shutil.copytree('path/to/dir', 'path/to/other_dir') new_tree = scandir('path/to/other_dir') assert tree != new_tree assert ( [p.relative for p in tree.leafpaths()] == [p.relative for p in new_tree.leafpaths()] ) ``` Inspect symlinks: ```python from scantree import CyclicLinkedDir file_links = [] dir_links = [] cyclic_links = [] def file_apply(path): if path.is_symlink(): file_links.append(path) def dir_apply(dir_node): if dir_node.path.is_symlink(): dir_links.append(dir_node.path) if isinstance(dir_node, CyclicLinkedDir): cyclic_links.append((dir_node.path, dir_node.target_path)) scantree('.', file_apply=file_apply, dir_apply=dir_apply) ``` Platform: UNKNOWN Description-Content-Type: text/markdown scantree-0.0.1/src/scantree.egg-info/SOURCES.txt0000644000076500000240000000057613447627322022611 0ustar andershussstaff00000000000000LICENSE MANIFEST.in README.md setup.py src/scantree/__init__.py src/scantree/_filter.py src/scantree/_node.py src/scantree/_path.py src/scantree/_scan.py src/scantree/compat.py src/scantree/test_utils.py src/scantree.egg-info/PKG-INFO src/scantree.egg-info/SOURCES.txt src/scantree.egg-info/dependency_links.txt src/scantree.egg-info/requires.txt src/scantree.egg-info/top_level.txtscantree-0.0.1/src/scantree.egg-info/dependency_links.txt0000644000076500000240000000000113447627322024763 0ustar andershussstaff00000000000000 scantree-0.0.1/src/scantree.egg-info/requires.txt0000644000076500000240000000017713447627322023322 0ustar andershussstaff00000000000000attrs>=18.0.0 six>=1.12.0 pathspec>=0.5.9 [:python_version < "3.4"] pathlib2>=2.3.3 [:python_version < "3.5"] scandir>=1.9.0 scantree-0.0.1/src/scantree.egg-info/top_level.txt0000644000076500000240000000001113447627322023437 0ustar andershussstaff00000000000000scantree