dirhash-0.2.1/0000755000076500000240000000000013721536627014452 5ustar andershussstaff00000000000000dirhash-0.2.1/LICENSE0000644000076500000240000000205413647556153015463 0ustar andershussstaff00000000000000MIT License Copyright (c) 2019 Anders Huss Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. dirhash-0.2.1/MANIFEST.in0000644000076500000240000000003213647556153016206 0ustar andershussstaff00000000000000include README.md LICENSE dirhash-0.2.1/PKG-INFO0000644000076500000240000001311513721536627015550 0ustar andershussstaff00000000000000Metadata-Version: 2.1 Name: dirhash Version: 0.2.1 Summary: Python module and CLI for hashing of file system directories. Home-page: https://github.com/andhus/dirhash-python Author: Anders Huss Author-email: andhus@kth.se License: MIT Description: [![Build Status](https://travis-ci.com/andhus/dirhash-python.svg?branch=master)](https://travis-ci.com/andhus/dirhash-python) [![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python) # dirhash A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content. - Supports all hashing algorithms of Python's built-in `hashlib` module. - Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude. - Multiprocessing for up to [6x speed-up](#performance) The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations. ## Installation From PyPI: ```commandline pip install dirhash ``` Or directly from source: ```commandline git clone git@github.com:andhus/dirhash-python.git pip install dirhash/ ``` ## Usage Python module: ```python from dirhash import dirhash dirpath = "path/to/directory" dir_md5 = dirhash(dirpath, "md5") pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"]) no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"]) ``` CLI: ```commandline dirhash path/to/directory -a md5 dirhash path/to/directory -a md5 --match "*.py" dirhash path/to/directory -a sha1 --ignore ".*" ".*/" ``` ## Why? If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing. There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) and [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python)) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. `dirhash` was created with this as the goal. [checksumdir](https://github.com/cakepietoast/checksumdir) is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests. ## Performance The python `hashlib` implementation of common hashing algorithms are highly optimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing `hashlib` code. The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files. As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/cli.py) with the shell command: `find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5` which is the top answer for the SO-question: [Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8. Implementation | Test Case | Time (s) | Speed up ------------------- | --------------- | -------: | -------: shell reference | flat_1k_1MB | 2.29 | -> 1.0 `dirhash` | flat_1k_1MB | 1.67 | 1.36 `dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73** shell reference | nested_32k_32kB | 6.82 | -> 1.0 `dirhash` | nested_32k_32kB | 3.43 | 2.00 `dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00** The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/tree/master/benchmark). ## Documentation Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash). Platform: UNKNOWN Description-Content-Type: text/markdown dirhash-0.2.1/README.md0000644000076500000240000001105113651271750015722 0ustar andershussstaff00000000000000[![Build Status](https://travis-ci.com/andhus/dirhash-python.svg?branch=master)](https://travis-ci.com/andhus/dirhash-python) [![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python) # dirhash A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content. - Supports all hashing algorithms of Python's built-in `hashlib` module. - Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude. - Multiprocessing for up to [6x speed-up](#performance) The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations. ## Installation From PyPI: ```commandline pip install dirhash ``` Or directly from source: ```commandline git clone git@github.com:andhus/dirhash-python.git pip install dirhash/ ``` ## Usage Python module: ```python from dirhash import dirhash dirpath = "path/to/directory" dir_md5 = dirhash(dirpath, "md5") pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"]) no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"]) ``` CLI: ```commandline dirhash path/to/directory -a md5 dirhash path/to/directory -a md5 --match "*.py" dirhash path/to/directory -a sha1 --ignore ".*" ".*/" ``` ## Why? If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing. There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) and [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python)) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. `dirhash` was created with this as the goal. [checksumdir](https://github.com/cakepietoast/checksumdir) is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests. ## Performance The python `hashlib` implementation of common hashing algorithms are highly optimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing `hashlib` code. The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files. As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/cli.py) with the shell command: `find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5` which is the top answer for the SO-question: [Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8. Implementation | Test Case | Time (s) | Speed up ------------------- | --------------- | -------: | -------: shell reference | flat_1k_1MB | 2.29 | -> 1.0 `dirhash` | flat_1k_1MB | 1.67 | 1.36 `dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73** shell reference | nested_32k_32kB | 6.82 | -> 1.0 `dirhash` | nested_32k_32kB | 3.43 | 2.00 `dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00** The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/tree/master/benchmark). ## Documentation Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash).dirhash-0.2.1/setup.cfg0000644000076500000240000000004613721536627016273 0ustar andershussstaff00000000000000[egg_info] tag_build = tag_date = 0 dirhash-0.2.1/setup.py0000644000076500000240000000207613721536227016165 0ustar andershussstaff00000000000000import io import os from setuptools import setup, find_packages PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__)) version = {} with io.open(os.path.join(PROJECT_ROOT, "src", "dirhash", "version.py")) as fp: exec(fp.read(), version) DESCRIPTION = 'Python module and CLI for hashing of file system directories.' try: with io.open(os.path.join(PROJECT_ROOT, 'README.md'), encoding='utf-8') as f: long_description = '\n' + f.read() except IOError: long_description = DESCRIPTION setup( name='dirhash', version=version['__version__'], description=DESCRIPTION, long_description=long_description, long_description_content_type="text/markdown", url='https://github.com/andhus/dirhash-python', author="Anders Huss", author_email="andhus@kth.se", license='MIT', install_requires=['scantree>=0.0.1'], packages=find_packages('src'), package_dir={'': 'src'}, include_package_data=True, entry_points={ 'console_scripts': ['dirhash=dirhash.cli:main'], }, tests_require=['pytest', 'pytest-cov'] ) dirhash-0.2.1/src/0000755000076500000240000000000013721536627015241 5ustar andershussstaff00000000000000dirhash-0.2.1/src/dirhash/0000755000076500000240000000000013721536627016663 5ustar andershussstaff00000000000000dirhash-0.2.1/src/dirhash/__init__.py0000644000076500000240000006202713721536227020777 0ustar andershussstaff00000000000000#!/usr/bin/env python """dirhash - a python library (and CLI) for hashing of file system directories. """ from __future__ import print_function, division import os import hashlib import pkg_resources from functools import partial from multiprocessing import Pool from scantree import ( scantree, RecursionFilter, CyclicLinkedDir, ) from dirhash.version import __version__ __all__ = [ '__version__', 'algorithms_guaranteed', 'algorithms_available', 'dirhash', 'dirhash_impl', 'included_paths', 'Filter', 'get_match_patterns', 'Protocol' ] algorithms_guaranteed = {'md5', 'sha1', 'sha224', 'sha256', 'sha384', 'sha512'} algorithms_available = hashlib.algorithms_available def dirhash( directory, algorithm, match=("*",), ignore=None, linked_dirs=True, linked_files=True, empty_dirs=False, entry_properties=('name', 'data'), allow_cyclic_links=False, chunk_size=2**20, jobs=1 ): """Computes the hash of a directory based on its structure and content. # Arguments directory: Union[str, pathlib.Path] - Path to the directory to hash. algorithm: str - The name of the hashing algorithm to use. See `dirhash.algorithms_available` for the available options. match: Iterable[str] - An iterable of glob/wildcard match-patterns for paths to include when computing the hash. Default is ["*"] which means that all files and directories are matched. To e.g. only include python source files, use: `match=["*.py"]`. See "Path Selection and Filtering" section below for further details. ignore: Optional[Iterable[str]] - An iterable of glob/wildcard match-patterns for paths to ignore when computing the hash. Default `None` (no ignore patterns). To e.g. exclude hidden files and directories use: `ignore=[".*/", ".*"]`. See "Path Selection and Filtering" section below for further details. linked_dirs: bool - If `True` (default), follow symbolic links to other *directories* and include these and their content in the hash computation. linked_files: bool - If `True` (default), include symbolic linked files in the hash computation. empty_dirs: bool - If `True`, include empty directories when computing the hash. A directory is considered empty if it does not contain any files that *matches provided matching criteria*. Default `False`, i.e. empty directories are ignored (as is done in git version control). entry_properties: Iterable[str] - A set (i.e. order does not matter) of the file/directory properties to consider when computing the hash. Supported properties are {"name", "data", "is_link"} where at least one of "name" and "data" must be included. Default is ["name", "data"] which means that the content (actual data) as well as the path relative to the root `directory` of files will affect the hash value. See "Entry Properties Interpretation" section below for further details. allow_cyclic_links: bool - If `False` (default) a `SymlinkRecursionError` is raised on presence of cyclic symbolic links. If set to `True` the the dirhash value for directory causing the cyclic link is replaced with the hash function hexdigest of the relative path from the link to the target. chunk_size: int - The number of bytes to read in one go from files while being hashed. A too small size will slow down the processing and a larger size consumes more working memory. Default 2**20 byte = 1 MiB. jobs: int - The number of processes to use when computing the hash. Default `1`, which means that a single (the main) process is used. NOTE that using multiprocessing can significantly speed-up execution, see `https://github.com/andhus/dirhash-python/benchmark` for further details. # Returns str - The hash/checksum as a string of the hexadecimal digits (the result of `hexdigest` method of the hashlib._hashlib.HASH object corresponding to the provided `algorithm`). # Raises TypeError/ValueError: For incorrectly provided arguments. SymlinkRecursionError: In case the `directory` contains symbolic links that lead to (infinite) recursion and `allow_cyclic_links=False` (default). # Path Selection and Filtering Provided glob/wildcard (".gitignore style") match-patterns determine what paths within the `directory` to include when computing the hash value. Paths *relative to the root `directory`* (i.e. excluding the name of the root directory itself) are matched against the patterns. The `match` argument represent what should be *included* - as opposed to the `ignore` argument for which matches are *excluded*. Using `ignore` is just short for adding the same patterns to the `match` argument with the prefix "!", i.e. the calls bellow are equivalent: `dirhash(..., match=["*", "!"])` `dirhash(..., ignore=[""])` To validate which paths are included, call `dirhash.included_paths` with the same values for the arguments: `match`, `ignore`, `linked_dirs`, `linked_files` and `empty_dirs` to get a list of all paths that will be included when computing the hash by this function. # Entry Properties Interpretation - ["name", "data"] (Default) - The name as well as data is included. Due to the recursive nature of the dirhash computation, "name" implies that the path relative to the root `directory` of each file/directory affects the computed hash value. - ["data"] - Compute the hash only based on the data of files - *not* their names or the names of their parent directories. NOTE that the tree structure in which files are organized under the `directory` root still influences the computed hash. As longs as all files have the same content and are organised the same way in relation to all other files in the Directed Acyclic Graph representing the file-tree, the hash will remain the same (but the "name of nodes" does not matter). This option can e.g. be used to verify that that data is unchanged after renaming files (change extensions etc.). - ["name"] - Compute the hash only based on the name and location of files in the file tree under the `directory` root. This option can e.g. be used to check if any files have been added/moved/removed, ignoring the content of each file. - "is_link" - if this options is added to any of the cases above the hash value is also affected by whether a file or directory is a symbolic link or not. NOTE: with this property added, the hash will be different than without it even if there are no symbolic links in the directory. # References See https://github.com/andhus/dirhash/README.md for a formal description of how the returned hash value is computed. """ filter_ = Filter( match_patterns=get_match_patterns(match=match, ignore=ignore), linked_dirs=linked_dirs, linked_files=linked_files, empty_dirs=empty_dirs ) protocol = Protocol( entry_properties=entry_properties, allow_cyclic_links=allow_cyclic_links ) return dirhash_impl( directory=directory, algorithm=algorithm, filter_=filter_, protocol=protocol, chunk_size=chunk_size, jobs=jobs ) def dirhash_impl( directory, algorithm, filter_=None, protocol=None, chunk_size=2**20, jobs=1 ): """Computes the hash of a directory based on its structure and content. In contrast to `dirhash.dirhash`, this function accepts custom implementations of the `dirhash.Filter` and `dirhash.Protocol` classes. # Arguments directory: Union[str, pathlib.Path] - Path to the directory to hash. algorithm: str - The name of the hashing algorithm to use. See `dirhash.algorithms_available` for the available options. It is also possible to provide a callable object that returns an instance implementing the `hashlib._hashlib.HASH` interface. filter_: dirhash.Filter - Determines what files and directories to include when computing the hash. See docs of `dirhash.Filter` for further details. protocol: dirhash.Protocol - Determines (mainly) what properties of files and directories to consider when computing the hash value. chunk_size: int - The number of bytes to read in one go from files while being hashed. A too small size will slow down the processing and a larger size consumes more working memory. Default 2**20 byte = 1 MiB. jobs: int - The number of processes to use when computing the hash. Default `1`, which means that a single (the main) process is used. NOTE that using multiprocessing can significantly speed-up execution, see `https://github.com/andhus/dirhash/tree/master/benchmark` for further details. # Returns str - The hash/checksum as a string of the hexadecimal digits (the result of `hexdigest` method of the hashlib._hashlib.HASH object corresponding to the provided `algorithm`). # Raises TypeError/ValueError: For incorrectly provided arguments. SymlinkRecursionError: In case the `directory` contains symbolic links that lead to (infinite) recursion and the protocol option `allow_cyclic_links` is `False`. # References See https://github.com/andhus/dirhash/README.md for a formal description of how the returned hash value is computed. """ def get_instance(value, cls_, argname): if isinstance(value, cls_): return value if value is None: return cls_() raise TypeError('{} must be an instance of {} or None'.format(argname, cls_)) filter_ = get_instance(filter_, Filter, 'filter_') protocol = get_instance(protocol, Protocol, 'protocol') hasher_factory = _get_hasher_factory(algorithm) def dir_apply(dir_node): if not filter_.empty_dirs: if dir_node.path.relative == '' and dir_node.empty: # only check if root node is empty (other empty dirs are filter # before `dir_apply` with `filter_.empty_dirs=False`) raise ValueError('{}: Nothing to hash'.format(directory)) descriptor = protocol.get_descriptor(dir_node) _dirhash = hasher_factory(descriptor.encode('utf-8')).hexdigest() return dir_node.path, _dirhash if jobs == 1: cache = {} def file_apply(path): return path, _get_filehash( path.real, hasher_factory, chunk_size=chunk_size, cache=cache ) _, dirhash_ = scantree( directory, recursion_filter=filter_, file_apply=file_apply, dir_apply=dir_apply, follow_links=True, allow_cyclic_links=protocol.allow_cyclic_links, cache_file_apply=False, include_empty=filter_.empty_dirs, jobs=1 ) else: # multiprocessing real_paths = set() def extract_real_paths(path): real_paths.add(path.real) return path root_node = scantree( directory, recursion_filter=filter_, file_apply=extract_real_paths, follow_links=True, allow_cyclic_links=protocol.allow_cyclic_links, cache_file_apply=False, include_empty=filter_.empty_dirs, jobs=1 ) real_paths = list(real_paths) # hash files in parallel file_hashes = _parmap( partial( _get_filehash, hasher_factory=hasher_factory, chunk_size=chunk_size ), real_paths, jobs=jobs ) # prepare the mapping with precomputed file hashes real_path_to_hash = dict(zip(real_paths, file_hashes)) def file_apply(path): return path, real_path_to_hash[path.real] _, dirhash_ = root_node.apply(file_apply=file_apply, dir_apply=dir_apply) return dirhash_ def included_paths( directory, match=("*",), ignore=None, linked_dirs=True, linked_files=True, empty_dirs=False, allow_cyclic_links=False, ): """Inspect what paths are included for the corresponding arguments to the `dirhash.dirhash` function. # Arguments: This function accepts the following subset of the function `dirhash.dirhash` arguments: `directory`, `match`, `ignore`, `linked_dirs`, `linked_files`, `empty_dirs` and `allow_cyclic_links`, *with the same interpretation*. See docs of `dirhash.dirhash` for further details. # Returns List[str] - A sorted list of the paths that would be included when computing the hash of the `directory` using `dirhash.dirhash` and the same arguments. """ filter_ = Filter( match_patterns=get_match_patterns(match=match, ignore=ignore), linked_dirs=linked_dirs, linked_files=linked_files, empty_dirs=empty_dirs ) protocol = Protocol(allow_cyclic_links=allow_cyclic_links) leafpaths = scantree( directory, recursion_filter=filter_, follow_links=True, allow_cyclic_links=protocol.allow_cyclic_links, include_empty=filter_.empty_dirs ).leafpaths() return [ path.relative if path.is_file() else os.path.join(path.relative, '.') for path in leafpaths ] class Filter(RecursionFilter): """Specification of what files and directories to include for the `dirhash` computation. # Arguments match: Iterable[str] - An iterable of glob/wildcard (".gitignore style") match patterns for selection of which files and directories to include. Paths *relative to the root `directory`* (i.e. excluding the name of the root directory itself) are matched against the provided patterns. For example, to include all files, except for hidden ones use: `match=['*', '!.*']` Default `None` which is equivalent to `['*']`, i.e. everything is included. linked_dirs: bool - If `True` (default), follow symbolic links to other *directories* and include these and their content in the hash computation. linked_files: bool - If `True` (default), include symbolic linked files in the hash computation. empty_dirs: bool - If `True`, include empty directories when computing the hash. A directory is considered empty if it does not contain any files that *matches provided matching criteria*. Default `False`, i.e. empty directories are ignored (as is done in git version control). """ def __init__( self, match_patterns=None, linked_dirs=True, linked_files=True, empty_dirs=False ): super(Filter, self).__init__( linked_dirs=linked_dirs, linked_files=linked_files, match=match_patterns ) self.empty_dirs = empty_dirs def get_match_patterns( match=None, ignore=None, ignore_extensions=None, ignore_hidden=False, ): """Helper to compose a list of list of glob/wildcard (".gitignore style") match patterns based on options dedicated for a few standard use-cases. # Arguments match: Optional[List[str]] - A list of match-patterns for files to *include*. Default `None` which is equivalent to `['*']`, i.e. everything is included (unless excluded by arguments below). ignore: Optional[List[str]] - A list of match-patterns for files to *ignore*. Default `None` (no ignore patterns). ignore_extensions: Optional[List[str]] - A list of file extensions to ignore. Short for `ignore=['*.', ...]` Default `None` (no extensions ignored). ignore_hidden: bool - If `True` ignore hidden files and directories. Short for `ignore=['.*', '.*/']` Default `False`. """ match = ['*'] if match is None else list(match) ignore = [] if ignore is None else list(ignore) ignore_extensions = [] if ignore_extensions is None else list(ignore_extensions) if ignore_hidden: ignore.extend(['.*', '.*/']) for ext in ignore_extensions: if not ext.startswith('.'): ext = '.' + ext ext = '*' + ext ignore.append(ext) match_spec = match + ['!' + ign for ign in ignore] def deduplicate(items): items_set = set([]) dd_items = [] for item in items: if item not in items_set: dd_items.append(item) items_set.add(item) return dd_items return deduplicate(match_spec) class Protocol(object): """Specifications of which file and directory properties to consider when computing the `dirhash` value. # Arguments entry_properties: Iterable[str] - A combination of the supported properties {"name", "data", "is_link"} where at least one of "name" and "data" is included. Interpretation: - ["name", "data"] (Default) - The name as well as data is included. Due to the recursive nature of the dirhash computation, "name" implies that the path relative to the root `directory` of each file/directory affects the computed hash value. - ["data"] - Compute the hash only based on the data of files - *not* their names or the names of their parent directories. NOTE that the tree structure in which files are organized under the `directory` root still influences the computed hash. As longs as all files have the same content and are organised the same way in relation to all other files in the Directed Acyclic Graph representing the file-tree, the hash will remain the same (but the "name of nodes" does not matter). This option can e.g. be used to verify that that data is unchanged after renaming files (change extensions etc.). - ["name"] - Compute the hash only based on the name and location of files in the file tree under the `directory` root. This option can e.g. be used to check if any files have been added/moved/removed, ignoring the content of each file. - "is_link" - if this options is added to any of the cases above the hash value is also affected by whether a file or directory is a symbolic link or not. NOTE: which this property added, the hash will be different than without it even if there are no symbolic links in the directory. allow_cyclic_links: bool - If `False` (default) a `SymlinkRecursionError` is raised on presence of cyclic symbolic links. If set to `True` the the dirhash value for directory causing the cyclic link is replaced with the hash function hexdigest of the relative path from the link to the target. """ class EntryProperties(object): NAME = 'name' DATA = 'data' IS_LINK = 'is_link' options = {NAME, DATA, IS_LINK} _DIRHASH = 'dirhash' _entry_property_separator = '\000' _entry_descriptor_separator = '\000\000' def __init__( self, entry_properties=('name', 'data'), allow_cyclic_links=False ): entry_properties = set(entry_properties) if not entry_properties.issubset(self.EntryProperties.options): raise ValueError( 'entry properties {} not supported'.format( entry_properties - self.EntryProperties.options) ) if not ( self.EntryProperties.NAME in entry_properties or self.EntryProperties.DATA in entry_properties ): raise ValueError( 'at least one of entry properties `name` and `data` must be used' ) self.entry_properties = entry_properties self._include_name = self.EntryProperties.NAME in entry_properties self._include_data = self.EntryProperties.DATA in entry_properties self._include_is_link = self.EntryProperties.IS_LINK in entry_properties if not isinstance(allow_cyclic_links, bool): raise ValueError( 'allow_cyclic_link must be a boolean, ' 'got {}'.format(allow_cyclic_links) ) self.allow_cyclic_links = allow_cyclic_links def get_descriptor(self, dir_node): if isinstance(dir_node, CyclicLinkedDir): return self._get_cyclic_linked_dir_descriptor(dir_node) entries = dir_node.directories + dir_node.files entry_descriptors = [ self._get_entry_descriptor( self._get_entry_properties(path, entry_hash) ) for path, entry_hash in entries ] return self._entry_descriptor_separator.join(sorted(entry_descriptors)) @classmethod def _get_entry_descriptor(cls, entry_properties): entry_strings = [ '{}:{}'.format(name, value) for name, value in entry_properties ] return cls._entry_property_separator.join(sorted(entry_strings)) def _get_entry_properties(self, path, entry_hash): properties = [] if path.is_dir(): properties.append((self.EntryProperties._DIRHASH, entry_hash)) elif self._include_data: # path is file properties.append((self.EntryProperties.DATA, entry_hash)) if self._include_name: properties.append((self.EntryProperties.NAME, path.name)) if self._include_is_link: properties.append((self.EntryProperties.IS_LINK, path.is_symlink)) return properties def _get_cyclic_linked_dir_descriptor(self, dir_node): relpath = dir_node.path.relative target_relpath = dir_node.target_path.relative path_to_target = os.path.relpath( # the extra '.' is needed if link back to root, because # an empty path ('') is not supported by os.path.relpath os.path.join('.', target_relpath), os.path.join('.', relpath) ) # TODO normalize posix! return path_to_target def _get_hasher_factory(algorithm): """Returns a "factory" of hasher instances corresponding to the given algorithm name. Bypasses input argument `algorithm` if it is already a hasher factory (verified by attempting calls to required methods). """ if algorithm in algorithms_guaranteed: return getattr(hashlib, algorithm) if algorithm in algorithms_available: return partial(hashlib.new, algorithm) try: # bypass algorithm if already a hasher factory hasher = algorithm(b'') hasher.update(b'') hasher.hexdigest() return algorithm except: pass raise ValueError( '`algorithm` must be one of: {}`'.format(algorithms_available)) def _parmap(func, iterable, jobs=1): """Map with multiprocessing.Pool""" if jobs == 1: return [func(element) for element in iterable] pool = Pool(jobs) try: results = pool.map(func, iterable) finally: pool.close() return results def _get_filehash(filepath, hasher_factory, chunk_size, cache=None): """Compute the hash of the given filepath. # Arguments filepath: str - Path to the file to hash. hasher_factory: (f: f() -> hashlib._hashlib.HASH): Callable that returns an instance of the `hashlib._hashlib.HASH` interface. chunk_size (int): The number of bytes to read in one go from files while being hashed. cache ({str: str} | None): A mapping from `filepath` to hash (return value of this function). If not None, a lookup will be attempted before hashing the file and the result will be added after completion. # Returns The hash/checksum as a string the of hexadecimal digits. # Side-effects The `cache` is updated if not None. """ if cache is not None: filehash = cache.get(filepath, None) if filehash is None: filehash = _get_filehash(filepath, hasher_factory, chunk_size) cache[filepath] = filehash return filehash hasher = hasher_factory() with open(filepath, 'rb') as f: for chunk in iter(lambda: f.read(chunk_size), b''): hasher.update(chunk) return hasher.hexdigest() dirhash-0.2.1/src/dirhash/cli.py0000644000076500000240000001441513647556153020014 0ustar andershussstaff00000000000000#!/usr/bin/env python """Get hash for the content and/or structure of a directory. """ from __future__ import print_function import sys import argparse import dirhash def main(): try: kwargs = get_kwargs(sys.argv[1:]) if kwargs.pop('list'): # kwargs below have no effect when listing for k in ['algorithm', 'chunk_size', 'jobs', 'entry_properties']: kwargs.pop(k) for leafpath in dirhash.included_paths(**kwargs): print(leafpath) else: print(dirhash.dirhash(**kwargs)) except Exception as e: # pragma: no cover (not picked up by coverage) sys.stderr.write('dirhash: {}\n'.format(e)) sys.exit(1) def get_kwargs(args): parser = argparse.ArgumentParser( description='Determine the hash for a directory.' ) parser.add_argument( '-v', '--version', action='version', version='dirhash {}'.format(dirhash.__version__) ) parser.add_argument( 'directory', help='Directory to hash.' ) parser.add_argument( '-a', '--algorithm', choices=dirhash.algorithms_available, default='md5', help=( 'Hashing algorithm to use, by default "md5". Always available: {}. ' 'Additionally available on current platform: {}. Note that the same ' 'algorithm may appear multiple times in this set under different names ' '(thanks to OpenSSL) ' '[https://docs.python.org/2/library/hashlib.html]'.format( sorted(dirhash.algorithms_guaranteed), sorted(dirhash.algorithms_available - dirhash.algorithms_guaranteed) ) ), metavar='' ) filter_options = parser.add_argument_group( title='Filtering options', description=( 'Specify what files and directories to include. All files and ' 'directories (including symbolic links) are included by default. The ' '--match/--ignore arguments allows for selection using glob/wildcard ' '(".gitignore style") path matching. Paths relative to the root ' '`directory` (i.e. excluding the name of the root directory itself) are ' 'matched against the provided patterns. For example, to only include ' 'python source files, use: `dirhash path/to/dir -m "*.py"` or to ' 'exclude hidden files and directories use: ' '`dirhash path/to.dir -i ".*" ".*/"` which is short for ' '`dirhash path/to.dir -m "*" "!.*" "!.*/"`. By adding the --list ' 'argument, all included paths, for the given filtering arguments, are ' 'returned instead of the hash value. For further details see ' 'https://github.com/andhus/dirhash/README.md#filtering' ) ) filter_options.add_argument( '-m', '--match', nargs='+', default=['*'], help=( 'One or several patterns for paths to include. NOTE: patterns ' 'with an asterisk must be in quotes ("*") or the asterisk ' 'preceded by an escape character (\*).' ), metavar='' ) filter_options.add_argument( '-i', '--ignore', nargs='+', default=None, help=( 'One or several patterns for paths to exclude. NOTE: patterns ' 'with an asterisk must be in quotes ("*") or the asterisk ' 'preceded by an escape character (\*).' ), metavar='' ) filter_options.add_argument( '--empty-dirs', action='store_true', default=False, help='Include empty directories (containing no files that meet the matching ' 'criteria and no non-empty sub directories).' ) filter_options.add_argument( '--no-linked-dirs', dest='linked_dirs', action='store_false', help='Do not include symbolic links to other directories.' ) filter_options.add_argument( '--no-linked-files', dest='linked_files', action='store_false', help='Do not include symbolic links to files.' ) parser.set_defaults(linked_dirs=True, linked_files=True) protocol_options = parser.add_argument_group( title='Protocol options', description=( 'Specify what properties of files and directories to include and ' 'whether to allow cyclic links. For further details see ' 'https://github.com/andhus/dirhash/DIRHASH_STANDARD.md#protocol' ) ) protocol_options.add_argument( '-p', '--properties', nargs='+', dest='entry_properties', default=['data', 'name'], help=( 'List of file/directory properties to include in the hash. Available ' 'properties are: {} and at least one of name and data must be ' 'included. Default is [data name] which means that both the name/paths' ' and content (actual data) of files and directories will be included' ).format(list(dirhash.Protocol.EntryProperties.options)), metavar='' ) protocol_options.add_argument( '-c', '--allow-cyclic-links', default=False, action='store_true', help=( 'Allow presence of cyclic links (by hashing the relative path to the ' 'target directory).' ) ) implementation_options = parser.add_argument_group( title='Implementation options', description='' ) implementation_options.add_argument( '-s', '--chunk-size', default=2**20, type=int, help='The chunk size (in bytes) for reading of files.' ) implementation_options.add_argument( '-j', '--jobs', type=int, default=1, # TODO make default number of cores? help='Number of jobs (parallel processes) to use.' ) special_options = parser.add_argument_group(title='Special options') special_options.add_argument( '-l', '--list', action='store_true', default=False, help='List the file paths that will be taken into account, given the ' 'provided filtering options.' ) return vars(parser.parse_args(args)) if __name__ == '__main__': # pragma: no cover main() dirhash-0.2.1/src/dirhash/version.py0000644000076500000240000000002613721536227020714 0ustar andershussstaff00000000000000__version__ = '0.2.1' dirhash-0.2.1/src/dirhash.egg-info/0000755000076500000240000000000013721536627020355 5ustar andershussstaff00000000000000dirhash-0.2.1/src/dirhash.egg-info/PKG-INFO0000644000076500000240000001311513721536627021453 0ustar andershussstaff00000000000000Metadata-Version: 2.1 Name: dirhash Version: 0.2.1 Summary: Python module and CLI for hashing of file system directories. Home-page: https://github.com/andhus/dirhash-python Author: Anders Huss Author-email: andhus@kth.se License: MIT Description: [![Build Status](https://travis-ci.com/andhus/dirhash-python.svg?branch=master)](https://travis-ci.com/andhus/dirhash-python) [![codecov](https://codecov.io/gh/andhus/dirhash-python/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash-python) # dirhash A lightweight python module and CLI for computing the hash of any directory based on its files' structure and content. - Supports all hashing algorithms of Python's built-in `hashlib` module. - Glob/wildcard (".gitignore style") path matching for expressive filtering of files to include/exclude. - Multiprocessing for up to [6x speed-up](#performance) The hash is computed according to the [Dirhash Standard](https://github.com/andhus/dirhash), which is designed to allow for consistent and collision resistant generation/verification of directory hashes across implementations. ## Installation From PyPI: ```commandline pip install dirhash ``` Or directly from source: ```commandline git clone git@github.com:andhus/dirhash-python.git pip install dirhash/ ``` ## Usage Python module: ```python from dirhash import dirhash dirpath = "path/to/directory" dir_md5 = dirhash(dirpath, "md5") pyfiles_md5 = dirhash(dirpath, "md5", match=["*.py"]) no_hidden_sha1 = dirhash(dirpath, "sha1", ignore=[".*", ".*/"]) ``` CLI: ```commandline dirhash path/to/directory -a md5 dirhash path/to/directory -a md5 --match "*.py" dirhash path/to/directory -a sha1 --ignore ".*" ".*/" ``` ## Why? If you (or your application) need to verify the integrity of a set of files as well as their name and location, you might find this useful. Use-cases range from verification of your image classification dataset (before spending GPU-$$$ on training your fancy Deep Learning model) to validation of generated files in regression-testing. There isn't really a standard way of doing this. There are plenty of recipes out there (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) and [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python)) but I couldn't find one that is properly tested (there are some gotcha:s to cover!) and documented with a compelling user interface. `dirhash` was created with this as the goal. [checksumdir](https://github.com/cakepietoast/checksumdir) is another python module/tool with similar intent (that inspired this project) but it lacks much of the functionality offered here (most notably including file names/structure in the hash) and lacks tests. ## Performance The python `hashlib` implementation of common hashing algorithms are highly optimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and combines the output. Reasonable measures have been taken to minimize the overhead and for common use-cases, the majority of time is spent reading data from disk and executing `hashlib` code. The main effort to boost performance is support for multiprocessing, where the reading and hashing is parallelized over individual files. As a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/cli.py) with the shell command: `find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5` which is the top answer for the SO-question: [Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents) Results for two test cases are shown below. Both have 1 GiB of random data: in "flat_1k_1MB", split into 1k files (1 MiB each) in a flat structure, and in "nested_32k_32kB", into 32k files (32 KiB each) spread over the 256 leaf directories in a binary tree of depth 8. Implementation | Test Case | Time (s) | Speed up ------------------- | --------------- | -------: | -------: shell reference | flat_1k_1MB | 2.29 | -> 1.0 `dirhash` | flat_1k_1MB | 1.67 | 1.36 `dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73** shell reference | nested_32k_32kB | 6.82 | -> 1.0 `dirhash` | nested_32k_32kB | 3.43 | 2.00 `dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00** The benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash-python/tree/master/benchmark). ## Documentation Please refer to `dirhash -h`, the python [source code](https://github.com/andhus/dirhash-python/blob/master/src/dirhash/__init__.py) and the [Dirhash Standard](https://github.com/andhus/dirhash). Platform: UNKNOWN Description-Content-Type: text/markdown dirhash-0.2.1/src/dirhash.egg-info/SOURCES.txt0000644000076500000240000000047413721536627022246 0ustar andershussstaff00000000000000LICENSE MANIFEST.in README.md setup.py src/dirhash/__init__.py src/dirhash/cli.py src/dirhash/version.py src/dirhash.egg-info/PKG-INFO src/dirhash.egg-info/SOURCES.txt src/dirhash.egg-info/dependency_links.txt src/dirhash.egg-info/entry_points.txt src/dirhash.egg-info/requires.txt src/dirhash.egg-info/top_level.txtdirhash-0.2.1/src/dirhash.egg-info/dependency_links.txt0000644000076500000240000000000113721536627024423 0ustar andershussstaff00000000000000 dirhash-0.2.1/src/dirhash.egg-info/entry_points.txt0000644000076500000240000000005613721536627023654 0ustar andershussstaff00000000000000[console_scripts] dirhash = dirhash.cli:main dirhash-0.2.1/src/dirhash.egg-info/requires.txt0000644000076500000240000000002013721536627022745 0ustar andershussstaff00000000000000scantree>=0.0.1 dirhash-0.2.1/src/dirhash.egg-info/top_level.txt0000644000076500000240000000001013721536627023076 0ustar andershussstaff00000000000000dirhash