pax_global_header00006660000000000000000000000064126471647470014534gustar00rootroot0000000000000052 comment=6f5906f87ef8894c1bc827e420d111db82333ff7 joblib-0.9.4/000077500000000000000000000000001264716474700130075ustar00rootroot00000000000000joblib-0.9.4/.gitignore000066400000000000000000000006761264716474700150100ustar00rootroot00000000000000*.py[oc] *.so # setup.py working directory build # setup.py dist directory dist # Editor temporary/working/backup files *$ .*.sw[nop] .sw[nop] *~ [#]*# .#* *.bak *.tmp *.tgz *.rej *.org .project *.diff .settings/ *.svn/ # Egg metadata *.egg-info # The shelf plugin uses this dir ./.shelf # Some IDEs add this directory .idea # Mac droppings .DS_Store doc/documentation.zip doc/generated doc/CHANGES.rst doc/README.rst # Coverage report .coverage joblib-0.9.4/.mailmap000066400000000000000000000007231264716474700144320ustar00rootroot00000000000000Gael Varoquaux Gael Varoquaux Gael Varoquaux Gael varoquaux Gael Varoquaux GaelVaroquaux Gael Varoquaux Gael VAROQUAUX Gael Varoquaux gvaroquaux joblib-0.9.4/.travis.yml000066400000000000000000000016651264716474700151300ustar00rootroot00000000000000# make it explicit that we favor the new container-based travis workers sudo: false language: python env: matrix: - PYTHON_VERSION="2.6" NUMPY_VERSION="1.6" - PYTHON_VERSION="2.7" NUMPY_VERSION="1.7" - PYTHON_VERSION="3.3" NUMPY_VERSION="1.8" - PYTHON_VERSION="3.4" NUMPY_VERSION="1.9" # NUMPY_VERSION not set means numpy is not installed - PYTHON_VERSION="3.4" - PYTHON_VERSION="3.5" NUMPY_VERSION="1.10" COVERAGE="true" install: - source continuous_integration/install.sh script: - make after_success: # Ignore coveralls failures as the coveralls server is not very reliable # but we don't want travis to report a failure in the github UI just # because the coverage report failed to be published. # coveralls need to be run from the git checkout # so we need to copy the coverage results from TEST_RUN_FOLDER - if [[ "$COVERAGE" == "true" ]]; then coveralls || echo "failed"; fi joblib-0.9.4/CHANGES.rst000066400000000000000000000332571264716474700146230ustar00rootroot00000000000000Latest changes =============== Release 0.9.4 ------------- Olivier Grisel FIX a race condition that could cause a joblib.Parallel to hang when collecting the result of a job that triggers an exception. https://github.com/joblib/joblib/pull/296 Olivier Grisel FIX a bug that caused joblib.Parallel to wrongly reuse previously memmapped arrays instead of creating new temporary files. https://github.com/joblib/joblib/pull/294 for more details. Loïc Estève FIX for raising non inheritable exceptions in a Parallel call. See https://github.com/joblib/joblib/issues/269 for more details. Alexandre Abadie FIX joblib.hash error with mixed types sets and dicts containing mixed types keys when using Python 3. see https://github.com/joblib/joblib/issues/254 Loïc Estève FIX joblib.dump/load for big numpy arrays with dtype=object. See https://github.com/joblib/joblib/issues/220 for more details. Loïc Estève FIX joblib.Parallel hanging when used with an exhausted iterator. See https://github.com/joblib/joblib/issues/292 for more details. Release 0.9.3 ------------- Olivier Grisel Revert back to the ``fork`` start method (instead of ``forkserver``) as the latter was found to cause crashes in interactive Python sessions. Release 0.9.2 ------------- Loïc Estève Joblib hashing now uses the default pickle protocol (2 for Python 2 and 3 for Python 3). This makes it very unlikely to get the same hash for a given object under Python 2 and Python 3. In particular, for Python 3 users, this means that the output of joblib.hash changes when switching from joblib 0.8.4 to 0.9.2 . We strive to ensure that the output of joblib.hash does not change needlessly in future versions of joblib but this is not officially guaranteed. Loïc Estève Joblib pickles generated with Python 2 can not be loaded with Python 3 and the same applies for joblib pickles generated with Python 3 and loaded with Python 2. During the beta period 0.9.0b2 to 0.9.0b4, we experimented with a joblib serialization that aimed to make pickles serialized with Python 3 loadable under Python 2. Unfortunately this serialization strategy proved to be too fragile as far as the long-term maintenance was concerned (For example see https://github.com/joblib/joblib/pull/243). That means that joblib pickles generated with joblib 0.9.0bN can not be loaded under joblib 0.9.2. Joblib beta testers, who are the only ones likely to be affected by this, are advised to delete their joblib cache when they upgrade from 0.9.0bN to 0.9.2. Arthur Mensch Fixed a bug with ``joblib.hash`` that used to return unstable values for strings and numpy.dtype instances depending on interning states. Olivier Grisel Make joblib use the 'forkserver' start method by default under Python 3.4+ to avoid causing crash with 3rd party libraries (such as Apple vecLib / Accelerate or the GCC OpenMP runtime) that use an internal thread pool that is not not reinitialized when a ``fork`` system call happens. Olivier Grisel New context manager based API (``with`` block) to re-use the same pool of workers across consecutive parallel calls. Vlad Niculae and Olivier Grisel Automated batching of fast tasks into longer running jobs to hide multiprocessing dispatching overhead when possible. Olivier Grisel FIX make it possible to call ``joblib.load(filename, mmap_mode='r')`` on pickled objects that include a mix of arrays of both memmory memmapable dtypes and object dtype. Release 0.8.4 ------------- 2014-11-20 Olivier Grisel OPTIM use the C-optimized pickler under Python 3 This makes it possible to efficiently process parallel jobs that deal with numerous Python objects such as large dictionaries. Release 0.8.3 ------------- 2014-08-19 Olivier Grisel FIX disable memmapping for object arrays 2014-08-07 Lars Buitinck MAINT NumPy 1.10-safe version comparisons 2014-07-11 Olivier Grisel FIX #146: Heisen test failure caused by thread-unsafe Python lists This fix uses a queue.Queue datastructure in the failing test. This datastructure is thread-safe thanks to an internal Lock. This Lock instance not picklable hence cause the picklability check of delayed to check fail. When using the threading backend, picklability is no longer required, hence this PRs give the user the ability to disable it on a case by case basis. Release 0.8.2 ------------- 2014-06-30 Olivier Grisel BUG: use mmap_mode='r' by default in Parallel and MemmapingPool The former default of mmap_mode='c' (copy-on-write) caused problematic use of the paging file under Windows. 2014-06-27 Olivier Grisel BUG: fix usage of the /dev/shm folder under Linux Release 0.8.1 ------------- 2014-05-29 Gael Varoquaux BUG: fix crash with high verbosity Release 0.8.0 ------------- 2014-05-14 Olivier Grisel Fix a bug in exception reporting under Python 3 2014-05-10 Olivier Grisel Fixed a potential segfault when passing non-contiguous memmap instances. 2014-04-22 Gael Varoquaux ENH: Make memory robust to modification of source files while the interpreter is running. Should lead to less spurious cache flushes and recomputations. 2014-02-24 Philippe Gervais New ``Memory.call_and_shelve`` API to handle memoized results by reference instead of by value. Release 0.8.0a3 --------------- 2014-01-10 Olivier Grisel & Gael Varoquaux FIX #105: Race condition in task iterable consumption when pre_dispatch != 'all' that could cause crash with error messages "Pools seems closed" and "ValueError: generator already executing". 2014-01-12 Olivier Grisel FIX #72: joblib cannot persist "output_dir" keyword argument. Release 0.8.0a2 --------------- 2013-12-23 Olivier Grisel ENH: set default value of Parallel's max_nbytes to 100MB Motivation: avoid introducing disk latency on medium sized parallel workload where memory usage is not an issue. FIX: properly handle the JOBLIB_MULTIPROCESSING env variable FIX: timeout test failures under windows Release 0.8.0a -------------- 2013-12-19 Olivier Grisel FIX: support the new Python 3.4 multiprocessing API 2013-12-05 Olivier Grisel ENH: make Memory respect mmap_mode at first call too ENH: add a threading based backend to Parallel This is low overhead alternative backend to the default multiprocessing backend that is suitable when calling compiled extensions that release the GIL. Author: Dan Stahlke Date: 2013-11-08 FIX: use safe_repr to print arg vals in trace This fixes a problem in which extremely long (and slow) stack traces would be produced when function parameters are large numpy arrays. 2013-09-10 Olivier Grisel ENH: limit memory copy with Parallel by leveraging numpy.memmap when possible Release 0.7.1 --------------- 2013-07-25 Gael Varoquaux MISC: capture meaningless argument (n_jobs=0) in Parallel 2013-07-09 Lars Buitinck ENH Handles tuples, sets and Python 3's dict_keys type the same as lists. in pre_dispatch 2013-05-23 Martin Luessi ENH: fix function caching for IPython Release 0.7.0 --------------- **This release drops support for Python 2.5 in favor of support for Python 3.0** 2013-02-13 Gael Varoquaux BUG: fix nasty hash collisions 2012-11-19 Gael Varoquaux ENH: Parallel: Turn of pre-dispatch for already expanded lists Gael Varoquaux 2012-11-19 ENH: detect recursive sub-process spawning, as when people do not protect the __main__ in scripts under Windows, and raise a useful error. Gael Varoquaux 2012-11-16 ENH: Full python 3 support Release 0.6.5 --------------- 2012-09-15 Yannick Schwartz BUG: make sure that sets and dictionnaries give reproducible hashes 2012-07-18 Marek Rudnicki BUG: make sure that object-dtype numpy array hash correctly 2012-07-12 GaelVaroquaux BUG: Bad default n_jobs for Parallel Release 0.6.4 --------------- 2012-05-07 Vlad Niculae ENH: controlled randomness in tests and doctest fix 2012-02-21 GaelVaroquaux ENH: add verbosity in memory 2012-02-21 GaelVaroquaux BUG: non-reproducible hashing: order of kwargs The ordering of a dictionnary is random. As a result the function hashing was not reproducible. Pretty hard to test Release 0.6.3 --------------- 2012-02-14 GaelVaroquaux BUG: fix joblib Memory pickling 2012-02-11 GaelVaroquaux BUG: fix hasher with Python 3 2012-02-09 GaelVaroquaux API: filter_args: `*args, **kwargs -> args, kwargs` Release 0.6.2 --------------- 2012-02-06 Gael Varoquaux BUG: make sure Memory pickles even if cachedir=None Release 0.6.1 --------------- Bugfix release because of a merge error in release 0.6.0 Release 0.6.0 --------------- **Beta 3** 2012-01-11 Gael Varoquaux BUG: ensure compatibility with old numpy DOC: update installation instructions BUG: file semantic to work under Windows 2012-01-10 Yaroslav Halchenko BUG: a fix toward 2.5 compatibility **Beta 2** 2012-01-07 Gael Varoquaux ENH: hash: bugware to be able to hash objects defined interactively in IPython 2012-01-07 Gael Varoquaux ENH: Parallel: warn and not fail for nested loops ENH: Parallel: n_jobs=-2 now uses all CPUs but one 2012-01-01 Juan Manuel Caicedo Carvajal and Gael Varoquaux ENH: add verbosity levels in Parallel Release 0.5.7 --------------- 2011-12-28 Gael varoquaux API: zipped -> compress 2011-12-26 Gael varoquaux ENH: Add a zipped option to Memory API: Memory no longer accepts save_npy 2011-12-22 Kenneth C. Arnold and Gael varoquaux BUG: fix numpy_pickle for array subclasses 2011-12-21 Gael varoquaux ENH: add zip-based pickling 2011-12-19 Fabian Pedregosa Py3k: compatibility fixes. This makes run fine the tests test_disk and test_parallel Release 0.5.6 --------------- 2011-12-11 Lars Buitinck ENH: Replace os.path.exists before makedirs with exception check New disk.mkdirp will fail with other errnos than EEXIST. 2011-12-10 Bala Subrahmanyam Varanasi MISC: pep8 compliant Release 0.5.5 --------------- 2011-19-10 Fabian Pedregosa ENH: Make joblib installable under Python 3.X Release 0.5.4 --------------- 2011-09-29 Jon Olav Vik BUG: Make mangling path to filename work on Windows 2011-09-25 Olivier Grisel FIX: doctest heisenfailure on execution time 2011-08-24 Ralf Gommers STY: PEP8 cleanup. Release 0.5.3 --------------- 2011-06-25 Gael varoquaux API: All the usefull symbols in the __init__ Release 0.5.2 --------------- 2011-06-25 Gael varoquaux ENH: Add cpu_count 2011-06-06 Gael varoquaux ENH: Make sure memory hash in a reproducible way Release 0.5.1 --------------- 2011-04-12 Gael varoquaux TEST: Better testing of parallel and pre_dispatch Yaroslav Halchenko 2011-04-12 DOC: quick pass over docs -- trailing spaces/spelling Yaroslav Halchenko 2011-04-11 ENH: JOBLIB_MULTIPROCESSING env var to disable multiprocessing from the environment Alexandre Gramfort 2011-04-08 ENH : adding log message to know how long it takes to load from disk the cache Release 0.5.0 --------------- 2011-04-01 Gael varoquaux BUG: pickling MemoizeFunc does not store timestamp 2011-03-31 Nicolas Pinto TEST: expose hashing bug with cached method 2011-03-26...2011-03-27 Pietro Berkes BUG: fix error management in rm_subdirs BUG: fix for race condition during tests in mem.clear() Gael varoquaux 2011-03-22...2011-03-26 TEST: Improve test coverage and robustness Gael varoquaux 2011-03-19 BUG: hashing functions with only \*var \**kwargs Gael varoquaux 2011-02-01... 2011-03-22 BUG: Many fixes to capture interprocess race condition when mem.cache is used by several processes on the same cache. Fabian Pedregosa 2011-02-28 First work on Py3K compatibility Gael varoquaux 2011-02-27 ENH: pre_dispatch in parallel: lazy generation of jobs in parallel for to avoid drowning memory. GaelVaroquaux 2011-02-24 ENH: Add the option of overloading the arguments of the mother 'Memory' object in the cache method that is doing the decoration. Gael varoquaux 2010-11-21 ENH: Add a verbosity level for more verbosity Release 0.4.6 ---------------- Gael varoquaux 2010-11-15 ENH: Deal with interruption in parallel Gael varoquaux 2010-11-13 BUG: Exceptions raised by Parallel when n_job=1 are no longer captured. Gael varoquaux 2010-11-13 BUG: Capture wrong arguments properly (better error message) Release 0.4.5 ---------------- Pietro Berkes 2010-09-04 BUG: Fix Windows peculiarities with path separators and file names BUG: Fix more windows locking bugs Gael varoquaux 2010-09-03 ENH: Make sure that exceptions raised in Parallel also inherit from the original exception class ENH: Add a shadow set of exceptions Fabian Pedregosa 2010-09-01 ENH: Clean up the code for parallel. Thanks to Fabian Pedregosa for the patch. Release 0.4.4 ---------------- Gael varoquaux 2010-08-23 BUG: Fix Parallel on computers with only one CPU, for n_jobs=-1. Gael varoquaux 2010-08-02 BUG: Fix setup.py for extra setuptools args. Gael varoquaux 2010-07-29 MISC: Silence tests (and hopefuly Yaroslav :P) Release 0.4.3 ---------------- Gael Varoquaux 2010-07-22 BUG: Fix hashing for function with a side effect modifying their input argument. Thanks to Pietro Berkes for reporting the bug and proving the patch. Release 0.4.2 ---------------- Gael Varoquaux 2010-07-16 BUG: Make sure that joblib still works with Python2.5. => release 0.4.2 Release 0.4.1 ---------------- joblib-0.9.4/MANIFEST.in000066400000000000000000000002001264716474700145350ustar00rootroot00000000000000include *.txt *.py recursive-include joblib *.rst *.py graft doc graft doc/_static graft doc/_templates global-exclude *~ *.swp joblib-0.9.4/Makefile000066400000000000000000000001501264716474700144430ustar00rootroot00000000000000 all: test test: nosetests test-no-multiprocessing: export JOBLIB_MULTIPROCESSING=0 && nosetests joblib-0.9.4/README.rst000066400000000000000000000114601264716474700145000ustar00rootroot00000000000000The homepage of joblib with user documentation is located on: https://pythonhosted.org/joblib/ Getting the latest code ========================= To get the latest code using git, simply type:: git clone git://github.com/joblib/joblib.git If you don't have git installed, you can download a zip or tarball of the latest code: http://github.com/joblib/joblib/archives/master Installing ========================= As any Python packages, to install joblib, simply do:: python setup.py install in the source code directory. Joblib has no other mandatory dependency than Python (supported versions are 2.6+ and 3.3+). Numpy (at least version 1.6.1) is an optional dependency for array manipulation. Workflow to contribute ========================= To contribute to joblib, first create an account on `github `_. Once this is done, fork the `joblib repository `_ to have you own repository, clone it using 'git clone' on the computers where you want to work. Make your changes in your clone, push them to your github account, test them on several computers, and when you are happy with them, send a pull request to the main repository. Running the test suite ========================= To run the test suite, you need the nose and coverage modules. Run the test suite using:: nosetests from the root of the project. |Travis| |AppVeyor| |Coveralls| .. |Travis| image:: https://travis-ci.org/joblib/joblib.svg?branch=master :target: https://travis-ci.org/joblib/joblib :alt: Travis build status .. |AppVeyor| image:: https://ci.appveyor.com/api/projects/status/github/joblib/joblib?branch=master&svg=true :target: https://ci.appveyor.com/project/joblib-ci/joblib/history :alt: AppVeyor build status .. |Coveralls| image:: https://coveralls.io/repos/joblib/joblib/badge.svg?branch=master&service=github :target: https://coveralls.io/github/joblib/joblib?branch=master :alt: Coveralls coverage Building the docs ========================= To build the docs you need to have setuptools and sphinx (>=0.5) installed. Run the command:: python setup.py build_sphinx The docs are built in the build/sphinx/html directory. Making a source tarball ========================= To create a source tarball, eg for packaging or distributing, run the following command:: python setup.py sdist The tarball will be created in the `dist` directory. This command will compile the docs, and the resulting tarball can be installed with no extra dependencies than the Python standard library. You will need setuptool and sphinx. Making a release and uploading it to PyPI ================================================== This command is only run by project manager, to make a release, and upload in to PyPI:: python setup.py sdist bdist_egg bdist_wheel register upload Updating the changelog ======================== Changes are listed in the CHANGES.rst file. They must be manually updated but, the following git command may be used to generate the lines:: git log --abbrev-commit --date=short --no-merges --sparse Licensing ---------- joblib is **BSD-licenced** (3 clause): This software is OSI Certified Open Source Software. OSI Certified is a certification mark of the Open Source Initiative. Copyright (c) 2009-2011, joblib developpers All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Gael Varoquaux. nor the names of other joblib contributors may be used to endorse or promote products derived from this software without specific prior written permission. **This software is provided by the copyright holders and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright owner or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.** joblib-0.9.4/TODO.rst000066400000000000000000000032601264716474700143070ustar00rootroot00000000000000Tasks at hand on joblib, in increasing order of difficulty. * Add a changelog! * In parallel: need to deal with return arguments that don't pickle. * Improve test coverage and documentation * Store a repr of the arguments for each call in the corresponding cachedir * Try to use Mike McKerns's Dill pickling module in Parallel: Implementation idea: * Create a new function that is wrapped and takes Dillo pickles as inputs as output, feed this one to multiprocessing * pickle everything using Dill in the Parallel object. http://dev.danse.us/trac/pathos/browser/dill * Make a sensible error message when wrong keyword arguments are given, currently we have:: from joblib import Memory mem = Memory(cachedir='cache') def f(a=0, b=2): return a, b g = mem.cache(f) g(c=2) /home/varoquau/dev/joblib/joblib/func_inspect.pyc in filter_args(func, ignore_lst, *args, **kwargs), line 168 TypeError: Ignore list for diffusion_reorder() contains and unexpected keyword argument 'cachedir' * add a 'depends' keyword argument to memory.cache, to be able to specify that a function depends on other functions, and thus that the cache should be cleared. * add a 'argument_hash' keyword argument to Memory.cache, to be able to replace the hashing logic of memory for the input arguments. It should accept as an input the dictionnary of arguments, as returned in func_inspect, and return a string. * add a sqlite db for provenance tracking. Store computation time and usage timestamps, to be able to do 'garbage-collection-like' cleaning of unused results, based on a cost function balancing computation cost and frequency of use. joblib-0.9.4/appveyor.yml000066400000000000000000000026561264716474700154100ustar00rootroot00000000000000environment: # There is no need to run the build for all the Python version / # architectures combo as the generated joblib wheel is the same on all # platforms (universal wheel). # We run the tests on 2 different target platforms for testing purpose only. matrix: - PYTHON: "C:\\Python27" PYTHON_VERSION: "2.7.x" PYTHON_ARCH: "32" - PYTHON: "C:\\Python34-x64" PYTHON_VERSION: "3.4.x" PYTHON_ARCH: "64" install: # Install Python (from the official .msi of http://python.org) and pip when # not already installed. - "powershell ./appveyor/install.ps1" - "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%" # Install the build and runtime dependencies of the project. - "pip install --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r appveyor/requirements.txt" - "python setup.py bdist_wheel" - ps: "ls dist" # Install the genreated wheel package to test it - "pip install --pre --no-index --find-links dist/ joblib" # Not a .NET project, we build in the install step instead build: false test_script: # Change to a non-source folder to make sure we run the tests on the # installed library. - "cd C:\\" - "python -c \"import nose; nose.main()\" -v -s joblib" artifacts: # Archive the generated wheel package in the ci.appveyor.com build report. - path: dist\* #on_success: # - TODO: upload the content of dist/*.whl to a public wheelhouse joblib-0.9.4/appveyor/000077500000000000000000000000001264716474700146545ustar00rootroot00000000000000joblib-0.9.4/appveyor/install.ps1000066400000000000000000000135431264716474700167550ustar00rootroot00000000000000# Sample script to install Python and pip under Windows # Authors: Olivier Grisel, Jonathan Helmus and Kyle Kastner # License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ $MINICONDA_URL = "http://repo.continuum.io/miniconda/" $BASE_URL = "https://www.python.org/ftp/python/" $GET_PIP_URL = "https://bootstrap.pypa.io/get-pip.py" $GET_PIP_PATH = "C:\get-pip.py" function DownloadPython ($python_version, $platform_suffix) { $webclient = New-Object System.Net.WebClient $filename = "python-" + $python_version + $platform_suffix + ".msi" $url = $BASE_URL + $python_version + "/" + $filename $basedir = $pwd.Path + "\" $filepath = $basedir + $filename if (Test-Path $filename) { Write-Host "Reusing" $filepath return $filepath } # Download and retry up to 3 times in case of network transient errors. Write-Host "Downloading" $filename "from" $url $retry_attempts = 2 for($i=0; $i -lt $retry_attempts; $i++){ try { $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } if (Test-Path $filepath) { Write-Host "File saved at" $filepath } else { # Retry once to get the error message if any at the last try $webclient.DownloadFile($url, $filepath) } return $filepath } function InstallPython ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "" } else { $platform_suffix = ".amd64" } $msipath = DownloadPython $python_version $platform_suffix Write-Host "Installing" $msipath "to" $python_home $install_log = $python_home + ".log" $install_args = "/qn /log $install_log /i $msipath TARGETDIR=$python_home" $uninstall_args = "/qn /x $msipath" RunCommand "msiexec.exe" $install_args if (-not(Test-Path $python_home)) { Write-Host "Python seems to be installed else-where, reinstalling." RunCommand "msiexec.exe" $uninstall_args RunCommand "msiexec.exe" $install_args } if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function RunCommand ($command, $command_args) { Write-Host $command $command_args Start-Process -FilePath $command -ArgumentList $command_args -Wait -Passthru } function InstallPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $python_path = $python_home + "\python.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $webclient = New-Object System.Net.WebClient $webclient.DownloadFile($GET_PIP_URL, $GET_PIP_PATH) Write-Host "Executing:" $python_path $GET_PIP_PATH Start-Process -FilePath "$python_path" -ArgumentList "$GET_PIP_PATH" -Wait -Passthru } else { Write-Host "pip already installed." } } function DownloadMiniconda ($python_version, $platform_suffix) { $webclient = New-Object System.Net.WebClient if ($python_version -eq "3.4") { $filename = "Miniconda3-3.5.5-Windows-" + $platform_suffix + ".exe" } else { $filename = "Miniconda-3.5.5-Windows-" + $platform_suffix + ".exe" } $url = $MINICONDA_URL + $filename $basedir = $pwd.Path + "\" $filepath = $basedir + $filename if (Test-Path $filename) { Write-Host "Reusing" $filepath return $filepath } # Download and retry up to 3 times in case of network transient errors. Write-Host "Downloading" $filename "from" $url $retry_attempts = 2 for($i=0; $i -lt $retry_attempts; $i++){ try { $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } if (Test-Path $filepath) { Write-Host "File saved at" $filepath } else { # Retry once to get the error message if any at the last try $webclient.DownloadFile($url, $filepath) } return $filepath } function InstallMiniconda ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "x86" } else { $platform_suffix = "x86_64" } $filepath = DownloadMiniconda $python_version $platform_suffix Write-Host "Installing" $filepath "to" $python_home $install_log = $python_home + ".log" $args = "/S /D=$python_home" Write-Host $filepath $args Start-Process -FilePath $filepath -ArgumentList $args -Wait -Passthru if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function InstallMinicondaPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $conda_path = $python_home + "\Scripts\conda.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $args = "install --yes pip" Write-Host $conda_path $args Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru } else { Write-Host "pip already installed." } } function main () { InstallPython $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON InstallPip $env:PYTHON } main joblib-0.9.4/appveyor/requirements.txt000066400000000000000000000003201264716474700201330ustar00rootroot00000000000000--find-links http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/ # force numpy version to use the wheel hosted on rackspace instead of # tarball from PyPI numpy==1.9.2 nose wheel joblib-0.9.4/benchmarks/000077500000000000000000000000001264716474700151245ustar00rootroot00000000000000joblib-0.9.4/benchmarks/bench_auto_batching.py000066400000000000000000000102401264716474700214410ustar00rootroot00000000000000"""Benchmark batching="auto" on high number of fast tasks The goal of this script is to study the behavior of the batch_size='auto' and in particular the impact of the default value of the joblib.parallel.MIN_IDEAL_BATCH_DURATION constant. """ # Author: Olivier Grisel # License: BSD 3 clause import numpy as np import time import tempfile from pprint import pprint from joblib import parallel, Parallel, delayed def sleep_noop(duration, input_data, output_data_size): """Noop function to emulate real computation. Simulate CPU time with by sleeping duration. Induce overhead by accepting (and ignoring) any amount of data as input and allocating a requested amount of data. """ time.sleep(duration) if output_data_size: return np.ones(output_data_size, dtype=np.byte) def bench_short_tasks(task_times, n_jobs=2, batch_size="auto", pre_dispatch="2*n_jobs", verbose=True, input_data_size=0, output_data_size=0, backend=None, memmap_input=False): with tempfile.NamedTemporaryFile() as temp_file: if input_data_size: # Generate some input data with the required size if memmap_input: temp_file.close() input_data = np.memmap(temp_file.name, shape=input_data_size, dtype=np.byte, mode='w+') input_data[:] = 1 else: input_data = np.ones(input_data_size, dtype=np.byte) else: input_data = None t0 = time.time() p = Parallel(n_jobs=n_jobs, verbose=verbose, pre_dispatch=pre_dispatch, batch_size=batch_size, backend=backend) p(delayed(sleep_noop)(max(t, 0), input_data, output_data_size) for t in task_times) duration = time.time() - t0 effective_batch_size = getattr(p, '_effective_batch_size', p.batch_size) print('Completed %d tasks in %0.3fs, final batch_size=%d\n' % (len(task_times), duration, effective_batch_size)) return duration, effective_batch_size if __name__ == "__main__": bench_parameters = dict( # batch_size=200, # batch_size='auto' by default # memmap_input=True, # if True manually memmap input out of timing # backend='threading', # backend='multiprocessing' by default # pre_dispatch='n_jobs', # pre_dispatch="2*n_jobs" by default input_data_size=int(2e7), # input data size in bytes output_data_size=int(1e5), # output data size in bytes n_jobs=2, verbose=10, ) print("Common benchmark parameters:") pprint(bench_parameters) parallel.MIN_IDEAL_BATCH_DURATION = 0.2 parallel.MAX_IDEAL_BATCH_DURATION = 2 # First pair of benchmarks to check that the auto-batching strategy is # stable (do not change the batch size too often) in the presence of large # variance while still be comparable to the equivalent load without # variance print('# high variance, no trend') # censored gaussian distribution high_variance = np.random.normal(loc=0.000001, scale=0.001, size=5000) high_variance[high_variance < 0] = 0 bench_short_tasks(high_variance, **bench_parameters) print('# low variance, no trend') low_variance = np.empty_like(high_variance) low_variance[:] = np.mean(high_variance) bench_short_tasks(low_variance, **bench_parameters) # Second pair of benchmarks: one has a cycling task duration pattern that # the auto batching feature should be able to roughly track. We use an even # power of cos to get only positive task durations with a majority close to # zero (only data transfer overhead). The shuffle variant should not # oscillate too much and still approximately have the same total run time. print('# cyclic trend') slow_time = 0.1 positive_wave = np.cos(np.linspace(1, 4 * np.pi, 300)) ** 8 cyclic = positive_wave * slow_time bench_short_tasks(cyclic, **bench_parameters) print("shuffling of the previous benchmark: same mean and variance") np.random.shuffle(cyclic) bench_short_tasks(cyclic, **bench_parameters) joblib-0.9.4/continuous_integration/000077500000000000000000000000001264716474700176205ustar00rootroot00000000000000joblib-0.9.4/continuous_integration/install.sh000077500000000000000000000055511264716474700216330ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "install" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # The behavior of the script is controlled by environment variabled defined # in the .travis.yml in the top level folder of the project. # # This script is adapted from a similar script from the scikit-learn repository. # # License: 3-clause BSD set -e print_conda_requirements() { # Echo a conda requirement string for example # "pip nose python='.7.3 scikit-learn=*". It has a hardcoded # list of possible packages to install and looks at _VERSION # environment variables to know whether to install a given package and # if yes which version to install. For example: # - for numpy, NUMPY_VERSION is used # - for scikit-learn, SCIKIT_LEARN_VERSION is used TO_INSTALL_ALWAYS="pip nose" REQUIREMENTS="$TO_INSTALL_ALWAYS" TO_INSTALL_MAYBE="python numpy" for PACKAGE in $TO_INSTALL_MAYBE; do # Capitalize package name and add _VERSION PACKAGE_VERSION_VARNAME="${PACKAGE^^}_VERSION" # replace - by _, needed for scikit-learn for example PACKAGE_VERSION_VARNAME="${PACKAGE_VERSION_VARNAME//-/_}" # dereference $PACKAGE_VERSION_VARNAME to figure out the # version to install PACKAGE_VERSION="${!PACKAGE_VERSION_VARNAME}" if [ -n "$PACKAGE_VERSION" ]; then REQUIREMENTS="$REQUIREMENTS $PACKAGE=$PACKAGE_VERSION" fi done echo $REQUIREMENTS } create_new_conda_env() { # Deactivate the travis-provided virtual environment and setup a # conda-based environment instead deactivate # Use the miniconda installer for faster download / install of conda # itself wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh \ -O miniconda.sh chmod +x miniconda.sh && ./miniconda.sh -b export PATH=/home/travis/miniconda2/bin:$PATH conda update --yes conda # Configure the conda environment and put it in the path using the # provided versions REQUIREMENTS=$(print_conda_requirements) echo "conda requirements string: $REQUIREMENTS" conda create -n testenv --yes $REQUIREMENTS source activate testenv if [[ "$INSTALL_MKL" == "true" ]]; then # Make sure that MKL is used conda install --yes mkl else # Make sure that MKL is not used conda remove --yes --features mkl || echo "MKL not installed" fi } create_new_conda_env if [ -z "$NUMPY_VERSION" ]; then # We want to disable doctests because they need numpy to run. I # could not find a way to override the with-doctest value in # setup.cfg so doing it the hacky way ... cat setup.cfg | grep -v 'with-doctest=' > setup.cfg.new mv setup.cfg{.new,} fi if [[ "$COVERAGE" == "true" ]]; then pip install coverage coveralls fi python setup.py install joblib-0.9.4/doc/000077500000000000000000000000001264716474700135545ustar00rootroot00000000000000joblib-0.9.4/doc/__init__.py000066400000000000000000000001351264716474700156640ustar00rootroot00000000000000""" This is a phony __init__.py file, so that nose finds the doctests in this directory. """ joblib-0.9.4/doc/_templates/000077500000000000000000000000001264716474700157115ustar00rootroot00000000000000joblib-0.9.4/doc/_templates/layout.html000066400000000000000000000012501264716474700201120ustar00rootroot00000000000000{% extends '!layout.html' %} {%- if pagename == 'index' %} {% set title = 'Joblib: running Python functions as pipeline jobs' %} {%- endif %} {%- block sidebarsourcelink %} {% endblock %} {%- block sidebarsearch %}
{{ super() }}

Mailing list

joblib@librelist.com

Send an email to subscribe

{%- if show_source and has_source and sourcename %}
{{ _('Show this page source') }} {%- endif %} {% endblock %} joblib-0.9.4/doc/conf.py000066400000000000000000000163431264716474700150620ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # joblib documentation build configuration file, created by # sphinx-quickstart on Thu Oct 23 16:36:51 2008. # # This file is execfile()d with the current directory set to its # containing dir. # # The contents of this file are pickled, so don't put values in the # namespace that aren't pickleable (module imports are okay, # they're removed automatically). # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os import joblib # If your extensions are in another directory, add it here. If the directory # is relative to the documentation root, use os.path.abspath to make it # absolute, like shown here. #sys.path.append(os.path.abspath('.')) sys.path.append(os.path.abspath('./sphinxext')) # General configuration # --------------------- # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['sphinx.ext.autodoc', 'sphinx.ext.pngmath', 'numpydoc', 'phantom_import', 'sphinx.ext.autosummary', 'sphinx.ext.coverage'] autosummary_generate = True # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8' # The master toctree document. master_doc = 'index' # General information about the project. project = 'joblib' copyright = '2008-2009, Gael Varoquaux' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = joblib.__version__ # The full version, including alpha/beta/rc tags. release = version # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of documents that shouldn't be included in the build. #unused_docs = [] # List of directories, relative to source directory, that shouldn't be searched # for source files. exclude_trees = [] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # Avoid '+DOCTEST...' comments in the docs trim_doctest_flags = True # Options for HTML output # ----------------------- # The style sheet to use for HTML and HTML Help pages. A file of that name # must exist either in Sphinx' static/ path, or in one of the custom paths # given in html_static_path. #html_style = 'default.css' # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_use_modindex = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, the reST sources are included in the HTML build as _sources/. #html_copy_source = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = '' # Output file base name for HTML help builder. htmlhelp_basename = 'joblibdoc' # Options for LaTeX output # ------------------------ # The paper size ('letter' or 'a4'). #latex_paper_size = 'letter' # The font size ('10pt', '11pt' or '12pt'). #latex_font_size = '10pt' # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, # document class [howto/manual]). latex_documents = [ ('index', 'joblib.tex', 'joblib Documentation', 'Gael Varoquaux', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # Additional stuff for the LaTeX preamble. #latex_preamble = '' # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_use_modindex = True # default is used to be compatible with both sphinx 1.2.3 and sphinx # 1.3.1. If we want to support only 1.3.1 'classic' can be used # instead html_theme = 'default' html_theme_options = { # "bgcolor": "#fff", # "footertextcolor": "#666", "relbarbgcolor": "#333", # "relbarlinkcolor": "#445481", # "relbartextcolor": "#445481", "sidebarlinkcolor": "#e15617", "sidebarbgcolor": "#000", # "sidebartextcolor": "#333", "footerbgcolor": "#111", "linkcolor": "#aa560c", # "bodyfont": '"Lucida Grande",Verdana,Lucida,Helvetica,Arial,sans-serif', # "headfont": "georgia, 'bitstream vera sans serif', 'lucida grande', # helvetica, verdana, sans-serif", # "headbgcolor": "#F5F5F5", "headtextcolor": "#643200", "codebgcolor": "#f5efe7", } ############################################################################## # Hack to copy the CHANGES.rst file import shutil try: shutil.copyfile('../CHANGES.rst', 'CHANGES.rst') shutil.copyfile('../README.rst', 'README.rst') except IOError: pass # This fails during the tesing, as the code is ran in a different # directory joblib-0.9.4/doc/developing.rst000066400000000000000000000001411264716474700164360ustar00rootroot00000000000000 =============== Development =============== .. include:: README.rst .. include:: CHANGES.rst joblib-0.9.4/doc/index.rst000066400000000000000000000016111264716474700154140ustar00rootroot00000000000000.. raw:: html .. raw:: html

Joblib: running Python functions as pipeline jobs

Introduction ------------ .. automodule:: joblib User manual -------------- .. toctree:: :maxdepth: 2 why.rst installing.rst memory.rst parallel.rst persistence.rst developing.rst Module reference ----------------- .. currentmodule:: joblib .. autosummary:: :toctree: generated Memory Parallel dump load hash joblib-0.9.4/doc/installing.rst000066400000000000000000000042171264716474700164560ustar00rootroot00000000000000Installing joblib =================== The `easy_install` way ----------------------- For the easiest way to install joblib you need to have `setuptools` installed. * For installing for all users, you need to run:: easy_install joblib You may need to run the above command as administrator On a unix environment, it is better to install outside of the hierarchy managed by the system:: easy_install --prefix /usr/local joblib * Installing only for a specific user is easy if you use Python 2.6 or above:: easy_install --user joblib .. warning:: Packages installed via `easy_install` override the Python module look up mechanism and thus can confused people not familiar with setuptools. Although it may seem harder, we suggest that you use the manual way, as described in the following paragraph. Using distributions -------------------- Joblib is packaged for several linux distribution: archlinux, debian, ubuntu, altlinux, and fedora. For minimum administration overhead, using the package manager is the recommended installation strategy on these systems. The manual way --------------- To install joblib first download the latest tarball (follow the link on the bottom of http://pypi.python.org/pypi/joblib) and expand it. Installing in a local environment .................................. If you don't need to install for all users, we strongly suggest that you create a local environment and install `joblib` in it. One of the pros of this method is that you never have to become administrator, and thus all the changes are local to your account and easy to clean up. Simply move to the directory created by expanding the `joblib` tarball and run the following command:: python setup.py install --user Installing for all users ........................ If you have administrator rights and want to install for all users, all you need to do is to go in directory created by expanding the `joblib` tarball and run the following line:: python setup.py install If you are under Unix, we suggest that you install in '/usr/local' in order not to interfere with your system:: python setup.py install --prefix /usr/local joblib-0.9.4/doc/memory.rst000066400000000000000000000273511264716474700156260ustar00rootroot00000000000000.. For doctests: >>> from joblib.testing import warnings_to_stdout >>> warnings_to_stdout() .. _memory: =========================================== On demand recomputing: the `Memory` class =========================================== .. currentmodule:: joblib.memory Usecase -------- The `Memory` class defines a context for lazy evaluation of function, by storing the results to the disk, and not rerunning the function twice for the same arguments. .. Commented out in favor of briefness You can use it as a context, with its `eval` method: .. automethod:: Memory.eval or decorate functions with the `cache` method: .. automethod:: Memory.cache It works by explicitly saving the output to a file and it is designed to work with non-hashable and potentially large input and output data types such as numpy arrays. A simple example: ~~~~~~~~~~~~~~~~~ First we create a temporary directory, for the cache:: >>> from tempfile import mkdtemp >>> cachedir = mkdtemp() We can instantiate a memory context, using this cache directory:: >>> from joblib import Memory >>> memory = Memory(cachedir=cachedir, verbose=0) Then we can decorate a function to be cached in this context:: >>> @memory.cache ... def f(x): ... print('Running f(%s)' % x) ... return x When we call this function twice with the same argument, it does not get executed the second time, and the output gets loaded from the pickle file:: >>> print(f(1)) Running f(1) 1 >>> print(f(1)) 1 However, when we call it a third time, with a different argument, the output gets recomputed:: >>> print(f(2)) Running f(2) 2 Comparison with `memoize` ~~~~~~~~~~~~~~~~~~~~~~~~~ The `memoize` decorator (http://code.activestate.com/recipes/52201/) caches in memory all the inputs and outputs of a function call. It can thus avoid running twice the same function, with a very small overhead. However, it compares input objects with those in cache on each call. As a result, for big objects there is a huge overhead. Moreover this approach does not work with numpy arrays, or other objects subject to non-significant fluctuations. Finally, using `memoize` with large objects will consume all the memory, where with `Memory`, objects are persisted to disk, using a persister optimized for speed and memory usage (:func:`joblib.dump`). In short, `memoize` is best suited for functions with "small" input and output objects, whereas `Memory` is best suited for functions with complex input and output objects, and aggressive persistence to disk. Using with `numpy` ------------------- The original motivation behind the `Memory` context was to be able to a memoize-like pattern on numpy arrays. `Memory` uses fast cryptographic hashing of the input arguments to check if they have been computed; An example ~~~~~~~~~~~ We define two functions, the first with a number as an argument, outputting an array, used by the second one. We decorate both functions with `Memory.cache`:: >>> import numpy as np >>> @memory.cache ... def g(x): ... print('A long-running calculation, with parameter %s' % x) ... return np.hamming(x) >>> @memory.cache ... def h(x): ... print('A second long-running calculation, using g(x)') ... return np.vander(x) If we call the function h with the array created by the same call to g, h is not re-run:: >>> a = g(3) A long-running calculation, with parameter 3 >>> a array([ 0.08, 1. , 0.08]) >>> g(3) array([ 0.08, 1. , 0.08]) >>> b = h(a) A second long-running calculation, using g(x) >>> b2 = h(a) >>> b2 array([[ 0.0064, 0.08 , 1. ], [ 1. , 1. , 1. ], [ 0.0064, 0.08 , 1. ]]) >>> np.allclose(b, b2) True Using memmapping ~~~~~~~~~~~~~~~~ To speed up cache looking of large numpy arrays, you can load them using memmapping (memory mapping):: >>> cachedir2 = mkdtemp() >>> memory2 = Memory(cachedir=cachedir2, mmap_mode='r') >>> square = memory2.cache(np.square) >>> a = np.vander(np.arange(3)).astype(np.float) >>> square(a) ________________________________________________________________________________ [Memory] Calling square... square(array([[ 0., 0., 1.], [ 1., 1., 1.], [ 4., 2., 1.]])) ___________________________________________________________square - 0.0s, 0.0min memmap([[ 0., 0., 1.], [ 1., 1., 1.], [ 16., 4., 1.]]) .. note:: Notice the debug mode used in the above example. It is useful for tracing of what is being reexecuted, and where the time is spent. If the `square` function is called with the same input argument, its return value is loaded from the disk using memmapping:: >>> res = square(a) >>> print(repr(res)) memmap([[ 0., 0., 1.], [ 1., 1., 1.], [ 16., 4., 1.]]) .. We need to close the memmap file to avoid file locking on Windows; closing numpy.memmap objects is done with del, which flushes changes to the disk >>> del res .. note:: If the memory mapping mode used was 'r', as in the above example, the array will be read only, and will be impossible to modified in place. On the other hand, using 'r+' or 'w+' will enable modification of the array, but will propagate these modification to the disk, which will corrupt the cache. If you want modification of the array in memory, we suggest you use the 'c' mode: copy on write. Shelving: using references to cached values ------------------------------------------- In some cases, it can be useful to get a reference to the cached result, instead of having the result itself. A typical example of this is when a lot of large numpy arrays must be dispatched accross several workers: instead of sending the data themselves over the network, send a reference to the joblib cache, and let the workers read the data from a network filesystem, potentially taking advantage of some system-level caching too. Getting a reference to the cache can be done using the `call_and_shelve` method on the wrapped function:: >>> result = g.call_and_shelve(4) A long-running calculation, with parameter 4 >>> result #doctest: +ELLIPSIS MemorizedResult(cachedir="...", func="g...", argument_hash="...") Once computed, the output of `g` is stored on disk, and deleted from memory. Reading the associated value can then be performed with the `get` method:: >>> result.get() array([ 0.08, 0.77, 0.77, 0.08]) The cache for this particular value can be cleared using the `clear` method. Its invocation causes the stored value to be erased from disk. Any subsequent call to `get` will cause a `KeyError` exception to be raised:: >>> result.clear() >>> result.get() #doctest: +ELLIPSIS Traceback (most recent call last): ... KeyError: 'Non-existing cache value (may have been cleared).\nFile ... does not exist' A `MemorizedResult` instance contains all that is necessary to read the cached value. It can be pickled for transmission or storage, and the printed representation can even be copy-pasted to a different python interpreter. .. topic:: Shelving when cache is disabled In the case where caching is disabled (e.g. `Memory(cachedir=None)`), the `call_and_shelve` method returns a `NotMemorizedResult` instance, that stores the full function output, instead of just a reference (since there is nothing to point to). All the above remains valid though, except for the copy-pasting feature. Gotchas -------- * **Across sessions, function cache is identified by the function's name**. Thus if you assign the same name to different functions, their cache will override each-others (you have 'name collisions'), and you will get unwanted re-run:: >>> @memory.cache ... def func(x): ... print('Running func(%s)' % x) >>> func2 = func >>> @memory.cache ... def func(x): ... print('Running a different func(%s)' % x) As long as you stay in the same session, there are no collisions (in joblib 0.8 and above), altough joblib does warn you that you are doing something dangerous:: >>> func(1) Running a different func(1) >>> func2(1) #doctest: +ELLIPSIS memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (:...) and 'func' (:...) Running func(1) >>> func(1) # No recomputation so far >>> func2(1) # No recomputation so far .. Empty the in-memory cache to simulate exiting and reloading the interpreter >>> import joblib.memory >>> joblib.memory._FUNCTION_HASHES.clear() But suppose you exit the interpreter and restart it, the cache will not be identified properly, and the functions will be rerun:: >>> func(1) #doctest: +ELLIPSIS memory.rst:0: JobLibCollisionWarning: Possible name collisions between functions 'func' (:...) and 'func' (:...) Running a different func(1) >>> func2(1) #doctest: +ELLIPSIS Running func(1) As long as you stay in the same session, you are not getting needless recomputation:: >>> func(1) # No recomputation now >>> func2(1) # No recomputation now * **lambda functions** Beware that with Python 2.6 lambda functions cannot be separated out:: >>> def my_print(x): ... print(x) >>> f = memory.cache(lambda : my_print(1)) >>> g = memory.cache(lambda : my_print(2)) >>> f() 1 >>> f() >>> g() # doctest: +SKIP memory.rst:0: JobLibCollisionWarning: Cannot detect name collisions for function '' 2 >>> g() # doctest: +SKIP >>> f() # doctest: +SKIP 1 * **memory cannot be used on some complex objects**, e.g. a callable object with a `__call__` method. However, it works on numpy ufuncs:: >>> sin = memory.cache(np.sin) >>> print(sin(0)) 0.0 * **caching methods**: you cannot decorate a method at class definition, because when the class is instantiated, the first argument (self) is *bound*, and no longer accessible to the `Memory` object. The following code won't work:: class Foo(object): @mem.cache # WRONG def method(self, args): pass The right way to do this is to decorate at instantiation time:: class Foo(object): def __init__(self, args): self.method = mem.cache(self.method) def method(self, ...): pass Ignoring some arguments ------------------------ It may be useful not to recalculate a function when certain arguments change, for instance a debug flag. `Memory` provides the `ignore` list:: >>> @memory.cache(ignore=['debug']) ... def my_func(x, debug=True): ... print('Called with x = %s' % x) >>> my_func(0) Called with x = 0 >>> my_func(0, debug=False) >>> my_func(0, debug=True) >>> # my_func was not reevaluated .. _memory_reference: Reference documentation of the `Memory` class ---------------------------------------------- .. autoclass:: Memory :members: __init__, cache, eval, clear Useful methods of decorated functions -------------------------------------- Function decorated by :meth:`Memory.cache` are :class:`MemorizedFunc` objects that, in addition of behaving like normal functions, expose methods useful for cache exploration and management. .. autoclass:: MemorizedFunc :members: __init__, call, clear, format_signature, format_call, get_output_dir, load_output .. Let us not forget to clean our cache dir once we are finished:: >>> import shutil >>> try: ... shutil.rmtree(cachedir) ... shutil.rmtree(cachedir2) ... except OSError: ... pass # this can sometimes fail under Windows joblib-0.9.4/doc/parallel.rst000066400000000000000000000121421264716474700161020ustar00rootroot00000000000000 ================================= Embarrassingly parallel for loops ================================= Common usage ============ Joblib provides a simple helper class to write parallel for loops using multiprocessing. The core idea is to write the code to be executed as a generator expression, and convert it to parallel computing:: >>> from math import sqrt >>> [sqrt(i ** 2) for i in range(10)] [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] can be spread over 2 CPUs using the following:: >>> from math import sqrt >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=2)(delayed(sqrt)(i ** 2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] Under the hood, the :class:`Parallel` object create a multiprocessing `pool` that forks the Python interpreter in multiple processes to execute each of the items of the list. The `delayed` function is a simple trick to be able to create a tuple `(function, args, kwargs)` with a function-call syntax. .. warning:: Under Windows, it is important to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.Parallel. In other words, you should be writing code like this: .. code-block:: python import .... def function1(...): ... def function2(...): ... ... if __name__ == '__main__': # do stuff with imports and functions defined about ... **No** code should *run* outside of the "if __name__ == '__main__'" blocks, only imports and definitions. Using the threading backend =========================== By default :class:`Parallel` uses the Python ``multiprocessing`` module to fork separate Python worker processes to execute tasks concurrently on separate CPUs. This is a reasonable default for generic Python programs but it induces some overhead as the input and output data need to be serialized in a queue for communication with the worker processes. If you know that the function you are calling is based on a compiled extension that releases the Python Global Interpreter Lock (GIL) during most of its computation then it might be more efficient to use threads instead of Python processes as concurrent workers. For instance this is the case if you write the CPU intensive part of your code inside a `with nogil`_ block of a Cython function. To use the threads, just pass ``"threading"`` as the value of the ``backend`` parameter of the :class:`Parallel` constructor: >>> Parallel(n_jobs=2, backend="threading")( ... delayed(sqrt)(i ** 2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] .. _`with nogil`:: http://docs.cython.org/src/userguide/external_C_code.html#acquiring-and-releasing-the-gil Reusing a pool of workers ========================= Some algorithms require to make several consecutive calls to a parallel function interleaved with processing of the intermediate results. Calling ``Parallel`` several times in a loop is sub-optimal because it will create and destroy a pool of workers (threads or processes) several times which can cause a significant overhead. For this case it is more efficient to use the context manager API of the ``Parallel`` class to re-use the same pool of workers for several calls to the ``Parallel`` object:: >>> with Parallel(n_jobs=2) as parallel: ... accumulator = 0. ... n_iter = 0 ... while accumulator < 1000: ... results = parallel(delayed(sqrt)(accumulator + i ** 2) ... for i in range(5)) ... accumulator += sum(results) # synchronization barrier ... n_iter += 1 ... >>> (accumulator, n_iter) # doctest: +ELLIPSIS (1136.596..., 14) .. include:: parallel_numpy.rst Bad interaction of multiprocessing and third-party libraries ============================================================ Prior to Python 3.4 the ``'multiprocessing'`` backend of joblib can only use the ``fork`` strategy to create worker processes under non-Windows systems. This can cause some third-party libraries to crash or freeze. Such libraries include as Apple vecLib / Accelerate (used by NumPy under OSX), some old version of OpenBLAS (prior to 0.2.10) or the OpenMP runtime implementation from GCC. To avoid this problem ``joblib.Parallel`` can be configured to use the ``'forkserver'`` start method on Python 3.4 and later. The start method has to be configured by setting the ``JOBLIB_START_METHOD`` environment variable to ``'forkserver'`` instead of the default ``'fork'`` start method. However the user should be aware that using the ``'forkserver'`` prevents ``joblib.Parallel`` to call function interactively defined in a shell session. You can read more on this topic in the `multiprocessing documentation `_. Under Windows the ``fork`` system call does not exist at all so this problem does not exist (but multiprocessing has more overhead). `Parallel` reference documentation ================================== .. autoclass:: joblib.Parallel :members: auto joblib-0.9.4/doc/parallel_numpy.rst000066400000000000000000000132511264716474700173340ustar00rootroot00000000000000Working with numerical data in shared memory (memmaping) ======================================================== By default the workers of the pool are real Python processes forked using the ``multiprocessing`` module of the Python standard library when ``n_jobs != 1``. The arguments passed as input to the ``Parallel`` call are serialized and reallocated in the memory of each worker process. This can be problematic for large arguments as they will be reallocated ``n_jobs`` times by the workers. As this problem can often occur in scientific computing with ``numpy`` based datastructures, :class:`joblib.Parallel` provides a special handling for large arrays to automatically dump them on the filesystem and pass a reference to the worker to open them as memory map on that file using the ``numpy.memmap`` subclass of ``numpy.ndarray``. This makes it possible to share a segment of data between all the worker processes. .. note:: The following only applies with the default ``"multiprocessing"`` backend. If your code can release the GIL, then using ``backend="threading"`` is even more efficient. Automated array to memmap conversion ------------------------------------ The automated array to memmap conversion is triggered by a configurable threshold on the size of the array:: >>> import numpy as np >>> from joblib import Parallel, delayed >>> from joblib.pool import has_shareable_memory >>> Parallel(n_jobs=2, max_nbytes=1e6)( ... delayed(has_shareable_memory)(np.ones(int(i))) ... for i in [1e2, 1e4, 1e6]) [False, False, True] By default the data is dumped to the ``/dev/shm`` shared-memory partition if it exists and writeable (typically the case under Linux). Otherwise the operating system's temporary folder is used. The location of the temporary data files can be customized by passing a ``temp_folder`` argument to the ``Parallel`` constructor. Passing ``max_nbytes=None`` makes it possible to disable the automated array to memmap conversion. Manual management of memmaped input data ---------------------------------------- For even finer tuning of the memory usage it is also possible to dump the array as an memmap directly from the parent process to free the memory before forking the worker processes. For instance let's allocate a large array in the memory of the parent process:: >>> large_array = np.ones(int(1e6)) Dump it to a local file for memmaping:: >>> import tempfile >>> import os >>> from joblib import load, dump >>> temp_folder = tempfile.mkdtemp() >>> filename = os.path.join(temp_folder, 'joblib_test.mmap') >>> if os.path.exists(filename): os.unlink(filename) >>> _ = dump(large_array, filename) >>> large_memmap = load(filename, mmap_mode='r+') The ``large_memmap`` variable is pointing to a ``numpy.memmap`` instance:: >>> large_memmap.__class__.__name__, large_array.nbytes, large_array.shape ('memmap', 8000000, (1000000,)) >>> np.allclose(large_array, large_memmap) True We can free the original array from the main process memory:: >>> del large_array >>> import gc >>> _ = gc.collect() It is possible to slice ``large_memmap`` into a smaller memmap:: >>> small_memmap = large_memmap[2:5] >>> small_memmap.__class__.__name__, small_memmap.nbytes, small_memmap.shape ('memmap', 24, (3,)) Finally we can also take a ``np.ndarray`` view backed on that same memory mapped file:: >>> small_array = np.asarray(small_memmap) >>> small_array.__class__.__name__, small_array.nbytes, small_array.shape ('ndarray', 24, (3,)) All those three datastructures point to the same memory buffer and this same buffer will also be reused directly by the worker processes of a ``Parallel`` call:: >>> Parallel(n_jobs=2, max_nbytes=None)( ... delayed(has_shareable_memory)(a) ... for a in [large_memmap, small_memmap, small_array]) [True, True, True] Note that here we used ``max_nbytes=None`` to disable the auto-dumping feature of ``Parallel``. ``small_array`` is still in shared memory in the worker processes because it was already backed by shared memory in the parent process. The pickling machinery of ``Parallel`` multiprocessing queues are able to detect this situation and optimize it on the fly to limit the number of memory copies. Writing parallel computation results in shared memory ----------------------------------------------------- If you open your data using the ``w+`` or ``r+`` mode in the main program, the worker will get ``r+`` mode access. Thus the worker will be able to write its results directly to the original data, alleviating the need of the serialization to send back the results to the parent process. Here is an example script on parallel processing with preallocated ``numpy.memmap`` datastructures: .. literalinclude:: ../examples/parallel_memmap.py :language: python :linenos: .. warning:: Having concurrent workers write on overlapping shared memory data segments, for instance by using inplace operators and assignments on a `numpy.memmap` instance, can lead to data corruption as numpy does not offer atomic operations. The previous example does not risk that issue as each task is updating an exclusive segment of the shared result array. Some C/C++ compilers offer lock-free atomic primitives such as add-and-fetch or compare-and-swap that could be exposed to Python via CFFI_ for instance. However providing numpy-aware atomic constructs is outside of the scope of the joblib project. .. _CFFI: https://cffi.readthedocs.org A final note: don't forget to clean up any temporary folder when you are done with the computation:: >>> import shutil >>> try: ... shutil.rmtree(temp_folder) ... except OSError: ... pass # this can sometimes fail under Windows joblib-0.9.4/doc/parallel_numpy_fixture.py000066400000000000000000000010731264716474700207210ustar00rootroot00000000000000"""Fixture module to skip memmaping test if numpy is not installed""" from nose import SkipTest from joblib.parallel import mp from joblib.test.common import setup_autokill from joblib.test.common import teardown_autokill def setup_module(module): numpy = None try: import numpy except ImportError: pass if numpy is None or mp is None: raise SkipTest('Skipped as numpy or multiprocessing is not available') setup_autokill(module.__name__, timeout=300) def teardown_module(module): teardown_autokill(module.__name__) joblib-0.9.4/doc/persistence.rst000066400000000000000000000077311264716474700166420ustar00rootroot00000000000000.. For doctests: >>> from joblib.testing import warnings_to_stdout >>> warnings_to_stdout() .. _persistence: =========== Persistence =========== .. currentmodule:: joblib.numpy_pickle Usecase ======= :func:`joblib.dump` and :func:`joblib.load` provide a replacement for pickle to work efficiently on Python objects containing large data, in particular large numpy arrays. A simple example ================ First we create a temporary directory:: >>> from tempfile import mkdtemp >>> savedir = mkdtemp() >>> import os >>> filename = os.path.join(savedir, 'test.pkl') Then we create an object to be persisted:: >>> import numpy as np >>> to_persist = [('a', [1, 2, 3]), ('b', np.arange(10))] which we save into `savedir`:: >>> import joblib >>> joblib.dump(to_persist, filename) # doctest: +ELLIPSIS ['...test.pkl', '...test.pkl_01.npy'] We can then load the object from the file:: >>> joblib.load(filename) [('a', [1, 2, 3]), ('b', array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))] .. note:: As you can see from the output, joblib pickle tends to be spread across multiple files. More precisely, on top of the main joblib pickle file (passed into the `joblib.dump` function), for each numpy array that the persisted object contains, an auxiliary .npy file with the binary data of the array will be created. When moving joblib pickle files around, you will need to remember to keep all these files together. Compressed joblib pickles ========================= Setting the `compress` argument to `True` in :func:`joblib.dump` will allow to save space on disk: >>> joblib.dump(to_persist, filename, compress=True) # doctest: +ELLIPSIS ['...test.pkl'] Another advantage is that it will create a single-file joblib pickle. More details can be found in the :func:`joblib.dump` and :func:`joblib.load` documentation. Compatibility across python versions ------------------------------------ Compatibility of joblib pickles across python versions is not fully supported. Note that, for a very restricted set of objects, this may appear to work when saving a pickle with python 2 and loading it with python 3 but relying on it is strongly discouraged. If you are switching between python versions, you will need to save a different joblib pickle for each python version. Here are a few examples or exceptions: - Saving joblib pickle with python 2, trying to load it with python 3:: Traceback (most recent call last): File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 453, in load obj = unpickler.load() File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1038, in load dispatch[key[0]](self) File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1176, in load_binstring self.append(self._decode_string(data)) File "/home/lesteve/miniconda3/lib/python3.4/pickle.py", line 1158, in _decode_string return value.decode(self.encoding, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1024: ordinal not in range(128) Traceback (most recent call last): File "", line 1, in File "/home/lesteve/dev/joblib/joblib/numpy_pickle.py", line 462, in load raise new_exc ValueError: You may be trying to read with python 3 a joblib pickle generated with python 2. This is not feature supported by joblib. - Saving joblib pickle with python 3, trying to load it with python 2:: Traceback (most recent call last): File "", line 1, in File "joblib/numpy_pickle.py", line 453, in load obj = unpickler.load() File "/home/lesteve/miniconda3/envs/py27/lib/python2.7/pickle.py", line 858, in load dispatch[key](self) File "/home/lesteve/miniconda3/envs/py27/lib/python2.7/pickle.py", line 886, in load_proto raise ValueError, "unsupported pickle protocol: %d" % proto ValueError: unsupported pickle protocol: 3 joblib-0.9.4/doc/sphinxext/000077500000000000000000000000001264716474700156065ustar00rootroot00000000000000joblib-0.9.4/doc/sphinxext/LICENSE.txt000066400000000000000000000027441264716474700174400ustar00rootroot00000000000000------------------------------------------------------------------------------- The files - numpydoc.py - docscrape.py - docscrape_sphinx.py - phantom_import.py have the following license: Copyright (C) 2008 Stefan van der Walt , Pauli Virtanen Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. joblib-0.9.4/doc/sphinxext/__init__.py000066400000000000000000000000001264716474700177050ustar00rootroot00000000000000joblib-0.9.4/doc/sphinxext/docscrape.py000066400000000000000000000346211264716474700201310ustar00rootroot00000000000000"""Extract reference documentation from the NumPy source tree. """ import inspect import textwrap import re import pydoc from warnings import warn class Reader(object): """A line-based string reader. """ def __init__(self, data): """ Parameters ---------- data : str String with lines separated by '\n'. """ if isinstance(data, list): self._str = data else: self._str = data.split('\n') # store string as list of lines self.reset() def __getitem__(self, n): return self._str[n] def reset(self): self._l = 0 # current line nr def read(self): if not self.eof(): out = self[self._l] self._l += 1 return out else: return '' def seek_next_non_empty_line(self): for l in self[self._l:]: if l.strip(): break else: self._l += 1 def eof(self): return self._l >= len(self._str) def read_to_condition(self, condition_func): start = self._l for line in self[start:]: if condition_func(line): return self[start:self._l] self._l += 1 if self.eof(): return self[start:self._l + 1] return [] def read_to_next_empty_line(self): self.seek_next_non_empty_line() def is_empty(line): return not line.strip() return self.read_to_condition(is_empty) def read_to_next_unindented_line(self): def is_unindented(line): return (line.strip() and (len(line.lstrip()) == len(line))) return self.read_to_condition(is_unindented) def peek(self, n=0): if self._l + n < len(self._str): return self[self._l + n] else: return '' def is_empty(self): return not ''.join(self._str).strip() class NumpyDocString(object): def __init__(self, docstring): docstring = textwrap.dedent(docstring).split('\n') self._doc = Reader(docstring) self._parsed_data = { 'Signature': '', 'Summary': [''], 'Extended Summary': [], 'Parameters': [], 'Returns': [], 'Raises': [], 'Warns': [], 'Other Parameters': [], 'Attributes': [], 'Methods': [], 'See Also': [], 'Notes': [], 'Warnings': [], 'References': '', 'Examples': '', 'index': {} } self._parse() def __getitem__(self, key): return self._parsed_data[key] def __setitem__(self, key, val): if not self._parsed_data.has_key(key): warn("Unknown section %s" % key) else: self._parsed_data[key] = val def _is_at_section(self): self._doc.seek_next_non_empty_line() if self._doc.eof(): return False l1 = self._doc.peek().strip() # e.g. Parameters if l1.startswith('.. index::'): return True l2 = self._doc.peek(1).strip() # ---------- or ========== return l2.startswith('-' * len(l1)) or l2.startswith('=' * len(l1)) def _strip(self, doc): i = 0 j = 0 for i, line in enumerate(doc): if line.strip(): break for j, line in enumerate(doc[::-1]): if line.strip(): break return doc[i:len(doc) - j] def _read_to_next_section(self): section = self._doc.read_to_next_empty_line() while not self._is_at_section() and not self._doc.eof(): if not self._doc.peek(-1).strip(): # previous line was empty section += [''] section += self._doc.read_to_next_empty_line() return section def _read_sections(self): while not self._doc.eof(): data = self._read_to_next_section() name = data[0].strip() if name.startswith('..'): # index section yield name, data[1:] elif len(data) < 2: yield StopIteration else: yield name, self._strip(data[2:]) def _parse_param_list(self, content): r = Reader(content) params = [] while not r.eof(): header = r.read().strip() if ' : ' in header: arg_name, arg_type = header.split(' : ')[:2] else: arg_name, arg_type = header, '' desc = r.read_to_next_unindented_line() desc = dedent_lines(desc) params.append((arg_name, arg_type, desc)) return params _name_rgx = re.compile(r"^\s*(:(?P\w+):`(?P[a-zA-Z0-9_.-]+)`|" r" (?P[a-zA-Z0-9_.-]+))\s*", re.X) def _parse_see_also(self, content): """ func_name : Descriptive text continued text another_func_name : Descriptive text func_name1, func_name2, :meth:`func_name`, func_name3 """ items = [] def parse_item_name(text): """Match ':role:`name`' or 'name'""" m = self._name_rgx.match(text) if m: g = m.groups() if g[1] is None: return g[3], None else: return g[2], g[1] raise ValueError("%s is not a item name" % text) def push_item(name, rest): if not name: return name, role = parse_item_name(name) items.append((name, list(rest), role)) del rest[:] current_func = None rest = [] for line in content: if not line.strip(): continue m = self._name_rgx.match(line) if m and line[m.end():].strip().startswith(':'): push_item(current_func, rest) current_func, line = line[:m.end()], line[m.end():] rest = [line.split(':', 1)[1].strip()] if not rest[0]: rest = [] elif not line.startswith(' '): push_item(current_func, rest) current_func = None if ',' in line: for func in line.split(','): push_item(func, []) elif line.strip(): current_func = line elif current_func is not None: rest.append(line.strip()) push_item(current_func, rest) return items def _parse_index(self, section, content): """ .. index: default :refguide: something, else, and more """ def strip_each_in(lst): return [s.strip() for s in lst] out = {} section = section.split('::') if len(section) > 1: out['default'] = strip_each_in(section[1].split(','))[0] for line in content: line = line.split(':') if len(line) > 2: out[line[1]] = strip_each_in(line[2].split(',')) return out def _parse_summary(self): """Grab signature (if given) and summary""" if self._is_at_section(): return summary = self._doc.read_to_next_empty_line() summary_str = " ".join([s.strip() for s in summary]).strip() if re.compile('^([\w., ]+=)?\s*[\w\.]+\(.*\)$').match(summary_str): self['Signature'] = summary_str if not self._is_at_section(): self['Summary'] = self._doc.read_to_next_empty_line() else: self['Summary'] = summary if not self._is_at_section(): self['Extended Summary'] = self._read_to_next_section() def _parse(self): self._doc.reset() self._parse_summary() for (section, content) in self._read_sections(): if not section.startswith('..'): section = ' '.join([s.capitalize() for s in section.split(' ') ]) if section in ('Parameters', 'Attributes', 'Methods', 'Returns', 'Raises', 'Warns'): self[section] = self._parse_param_list(content) elif section.startswith('.. index::'): self['index'] = self._parse_index(section, content) elif section == 'See Also': self['See Also'] = self._parse_see_also(content) else: self[section] = content # string conversion routines def _str_header(self, name, symbol='-'): return [name, len(name) * symbol] def _str_indent(self, doc, indent=4): out = [] for line in doc: out += [' ' * indent + line] return out def _str_signature(self): if self['Signature']: return [self['Signature'].replace('*', '\*')] + [''] else: return [''] def _str_summary(self): if self['Summary']: return self['Summary'] + [''] else: return [] def _str_extended_summary(self): if self['Extended Summary']: return self['Extended Summary'] + [''] else: return [] def _str_param_list(self, name): out = [] if self[name]: out += self._str_header(name) for param, param_type, desc in self[name]: out += ['%s : %s' % (param, param_type)] out += self._str_indent(desc) out += [''] return out def _str_section(self, name): out = [] if self[name]: out += self._str_header(name) out += self[name] out += [''] return out def _str_see_also(self, func_role): if not self['See Also']: return [] out = [] out += self._str_header("See Also") last_had_desc = True for func, desc, role in self['See Also']: if role: link = ':%s:`%s`' % (role, func) elif func_role: link = ':%s:`%s`' % (func_role, func) else: link = "`%s`_" % func if desc or last_had_desc: out += [''] out += [link] else: out[-1] += ", %s" % link if desc: out += self._str_indent([' '.join(desc)]) last_had_desc = True else: last_had_desc = False out += [''] return out def _str_index(self): idx = self['index'] out = [] out += ['.. index:: %s' % idx.get('default', '')] for section, references in idx.iteritems(): if section == 'default': continue out += [' :%s: %s' % (section, ', '.join(references))] return out def __str__(self, func_role=''): out = [] out += self._str_signature() out += self._str_summary() out += self._str_extended_summary() for param_list in ('Parameters', 'Returns', 'Raises'): out += self._str_param_list(param_list) out += self._str_section('Warnings') out += self._str_see_also(func_role) for s in ('Notes', 'References', 'Examples'): out += self._str_section(s) out += self._str_index() return '\n'.join(out) def indent(str, indent=4): indent_str = ' ' * indent if str is None: return indent_str lines = str.split('\n') return '\n'.join(indent_str + l for l in lines) def dedent_lines(lines): """Deindent a list of lines maximally""" return textwrap.dedent("\n".join(lines)).split("\n") def header(text, style='-'): return text + '\n' + style * len(text) + '\n' class FunctionDoc(NumpyDocString): def __init__(self, func, role='func'): self._f = func self._role = role # e.g. "func" or "meth" try: NumpyDocString.__init__(self, inspect.getdoc(func) or '') except ValueError as e: print('*' * 78) print("ERROR: '%s' while parsing `%s`" % (e, self._f)) print('*' * 78) if not self['Signature']: func, func_name = self.get_func() try: # try to read signature argspec = inspect.getargspec(func) argspec = inspect.formatargspec(*argspec) argspec = argspec.replace('*', '\*') signature = '%s%s' % (func_name, argspec) except TypeError as e: signature = '%s()' % func_name self['Signature'] = signature def get_func(self): func_name = getattr(self._f, '__name__', self.__class__.__name__) if inspect.isclass(self._f): func = getattr(self._f, '__call__', self._f.__init__) else: func = self._f return func, func_name def __str__(self): out = '' func, func_name = self.get_func() signature = self['Signature'].replace('*', '\*') roles = {'func': 'function', 'meth': 'method'} if self._role: if not roles.has_key(self._role): print("Warning: invalid role %s" % self._role) out += '.. %s:: %s\n \n\n' % (roles.get(self._role, ''), func_name) out += super(FunctionDoc, self).__str__(func_role=self._role) return out class ClassDoc(NumpyDocString): def __init__(self, cls, modulename='', func_doc=FunctionDoc): if not inspect.isclass(cls): raise ValueError("Initialise using a class. Got %r" % cls) self._cls = cls if modulename and not modulename.endswith('.'): modulename += '.' self._mod = modulename self._name = cls.__name__ self._func_doc = func_doc NumpyDocString.__init__(self, pydoc.getdoc(cls)) @property def methods(self): return [name for name, func in inspect.getmembers(self._cls) if not name.startswith('_') and callable(func)] def __str__(self): out = '' out += super(ClassDoc, self).__str__() out += "\n\n" #for m in self.methods: # print "Parsing `%s`" % m # out += str(self._func_doc(getattr(self._cls,m), 'meth')) + '\n\n' # out += '.. index::\n single: %s; %s\n\n' % (self._name, m) return out joblib-0.9.4/doc/sphinxext/docscrape_sphinx.py000066400000000000000000000104521264716474700215160ustar00rootroot00000000000000import inspect import textwrap import pydoc import sys if sys.version_info[0] == 2: from docscrape import NumpyDocString from docscrape import FunctionDoc from docscrape import ClassDoc else: from .docscrape import NumpyDocString from .docscrape import FunctionDoc from .docscrape import ClassDoc class SphinxDocString(NumpyDocString): # string conversion routines def _str_header(self, name, symbol='`'): return ['.. rubric:: ' + name, ''] def _str_field_list(self, name): return [':' + name + ':'] def _str_indent(self, doc, indent=4): out = [] for line in doc: out += [' ' * indent + line] return out def _str_signature(self): return [''] if self['Signature']: return ['``%s``' % self['Signature']] + [''] else: return [''] def _str_summary(self): return self['Summary'] + [''] def _str_extended_summary(self): return self['Extended Summary'] + [''] def _str_param_list(self, name): out = [] if self[name]: out += self._str_field_list(name) out += [''] for param, param_type, desc in self[name]: out += self._str_indent(['**%s** : %s' % (param.strip(), param_type)]) out += [''] out += self._str_indent(desc, 8) out += [''] return out def _str_section(self, name): out = [] if self[name]: out += self._str_header(name) out += [''] content = textwrap.dedent("\n".join(self[name])).split("\n") out += content out += [''] return out def _str_see_also(self, func_role): out = [] if self['See Also']: see_also = super(SphinxDocString, self)._str_see_also(func_role) out = ['.. seealso::', ''] out += self._str_indent(see_also[2:]) return out def _str_warnings(self): out = [] if self['Warnings']: out = ['.. warning::', ''] out += self._str_indent(self['Warnings']) return out def _str_index(self): idx = self['index'] out = [] if len(idx) == 0: return out out += ['.. index:: %s' % idx.get('default', '')] for section, references in idx.iteritems(): if section == 'default': continue elif section == 'refguide': out += [' single: %s' % (', '.join(references))] else: out += [' %s: %s' % (section, ','.join(references))] return out def _str_references(self): out = [] if self['References']: out += self._str_header('References') if isinstance(self['References'], str): self['References'] = [self['References']] out.extend(self['References']) out += [''] return out def __str__(self, indent=0, func_role="obj"): out = [] out += self._str_signature() out += self._str_index() + [''] out += self._str_summary() out += self._str_extended_summary() for param_list in ('Parameters', 'Attributes', 'Methods', 'Returns', 'Raises'): out += self._str_param_list(param_list) out += self._str_warnings() out += self._str_see_also(func_role) out += self._str_section('Notes') out += self._str_references() out += self._str_section('Examples') out = self._str_indent(out, indent) return '\n'.join(out) class SphinxFunctionDoc(SphinxDocString, FunctionDoc): pass class SphinxClassDoc(SphinxDocString, ClassDoc): pass def get_doc_object(obj, what=None): if what is None: if inspect.isclass(obj): what = 'class' elif inspect.ismodule(obj): what = 'module' elif callable(obj): what = 'function' else: what = 'object' if what == 'class': return SphinxClassDoc(obj, '', func_doc=SphinxFunctionDoc) elif what in ('function', 'method'): return SphinxFunctionDoc(obj, '') else: return SphinxDocString(pydoc.getdoc(obj)) joblib-0.9.4/doc/sphinxext/numpydoc.py000066400000000000000000000077111264716474700200240ustar00rootroot00000000000000""" ======== numpydoc ======== Sphinx extension that handles docstrings in the Numpy standard format. [1] It will: - Convert Parameters etc. sections to field lists. - Convert See Also section to a See also entry. - Renumber references. - Extract the signature from the docstring, if it can't be determined otherwise. .. [1] http://projects.scipy.org/scipy/numpy/wiki/CodingStyleGuidelines#docstring-standard """ import re import pydoc import inspect import sys if sys.version_info[0] == 2: from docscrape_sphinx import get_doc_object from docscrape_sphinx import SphinxDocString else: from .docscrape_sphinx import get_doc_object from .docscrape_sphinx import SphinxDocString def mangle_docstrings(app, what, name, obj, options, lines, reference_offset=[0]): if what == 'module': # Strip top title title_re = re.compile(r'^\s*[#*=]{4,}\n[a-z0-9 -]+\n[#*=]{4,}\s*', re.I | re.S) lines[:] = title_re.sub('', "\n".join(lines)).split("\n") else: doc = get_doc_object(obj, what) lines[:] = str(doc).split("\n") if app.config.numpydoc_edit_link and hasattr(obj, '__name__') and \ obj.__name__: v = dict(full_name=obj.__name__) lines += [''] + (app.config.numpydoc_edit_link % v).split("\n") # replace reference numbers so that there are no duplicates references = [] for l in lines: l = l.strip() if l.startswith('.. ['): try: references.append(int(l[len('.. ['):l.index(']')])) except ValueError: print("WARNING: invalid reference in %s docstring" % name) # Start renaming from the biggest number, otherwise we may # overwrite references. references.sort() if references: for i, line in enumerate(lines): for r in references: new_r = reference_offset[0] + r lines[i] = lines[i].replace('[%d]_' % r, '[%d]_' % new_r) lines[i] = lines[i].replace('.. [%d]' % r, '.. [%d]' % new_r) reference_offset[0] += len(references) def mangle_signature(app, what, name, obj, options, sig, retann): # Do not try to inspect classes that don't define `__init__` if (inspect.isclass(obj) and 'initializes x; see ' in pydoc.getdoc(obj.__init__)): return '', '' if not (callable(obj) or hasattr(obj, '__argspec_is_invalid_')): return if not hasattr(obj, '__doc__'): return doc = SphinxDocString(pydoc.getdoc(obj)) if doc['Signature']: sig = re.sub("^[^(]*", "", doc['Signature']) return sig, '' def initialize(app): try: app.connect('autodoc-process-signature', mangle_signature) except: monkeypatch_sphinx_ext_autodoc() def setup(app, get_doc_object_=get_doc_object): global get_doc_object get_doc_object = get_doc_object_ app.connect('autodoc-process-docstring', mangle_docstrings) app.connect('builder-inited', initialize) app.add_config_value('numpydoc_edit_link', None, True) #------------------------------------------------------------------------------ # Monkeypatch sphinx.ext.autodoc to accept argspecless autodocs (Sphinx < 0.5) #------------------------------------------------------------------------------ def monkeypatch_sphinx_ext_autodoc(): global _original_format_signature import sphinx.ext.autodoc if sphinx.ext.autodoc.format_signature is our_format_signature: return print("[numpydoc] Monkeypatching sphinx.ext.autodoc ...") _original_format_signature = sphinx.ext.autodoc.format_signature sphinx.ext.autodoc.format_signature = our_format_signature def our_format_signature(what, obj): r = mangle_signature(None, what, None, obj, None, None, None) if r is not None: return r[0] else: return _original_format_signature(what, obj) joblib-0.9.4/doc/sphinxext/phantom_import.py000066400000000000000000000133061264716474700212230ustar00rootroot00000000000000""" ============== phantom_import ============== Sphinx extension to make directives from ``sphinx.ext.autodoc`` and similar extensions to use docstrings loaded from an XML file. This extension loads an XML file in the Pydocweb format [1] and creates a dummy module that contains the specified docstrings. This can be used to get the current docstrings from a Pydocweb instance without needing to rebuild the documented module. .. [1] http://code.google.com/p/pydocweb """ import imp import sys import os import inspect import re def setup(app): app.connect('builder-inited', initialize) app.add_config_value('phantom_import_file', None, True) def initialize(app): fn = app.config.phantom_import_file if (fn and os.path.isfile(fn)): print("[numpydoc] Phantom importing modules from %s ..." % fn) import_phantom_module(fn) #------------------------------------------------------------------------------ # Creating 'phantom' modules from an XML description #------------------------------------------------------------------------------ def import_phantom_module(xml_file): """ Insert a fake Python module to sys.modules, based on a XML file. The XML file is expected to conform to Pydocweb DTD. The fake module will contain dummy objects, which guarantee the following: - Docstrings are correct. - Class inheritance relationships are correct (if present in XML). - Function argspec is *NOT* correct (even if present in XML). Instead, the function signature is prepended to the function docstring. - Class attributes are *NOT* correct; instead, they are dummy objects. Parameters ---------- xml_file : str Name of an XML file to read """ import lxml.etree as etree object_cache = {} tree = etree.parse(xml_file) root = tree.getroot() # Sort items so that # - Base classes come before classes inherited from them # - Modules come before their contents all_nodes = dict([(n.attrib['id'], n) for n in root]) def _get_bases(node, recurse=False): bases = [x.attrib['ref'] for x in node.findall('base')] if recurse: j = 0 while True: try: b = bases[j] except IndexError: break if b in all_nodes: bases.extend(_get_bases(all_nodes[b])) j += 1 return bases type_index = ['module', 'class', 'callable', 'object'] def base_cmp(a, b): x = cmp(type_index.index(a.tag), type_index.index(b.tag)) if x != 0: return x if a.tag == 'class' and b.tag == 'class': a_bases = _get_bases(a, recurse=True) b_bases = _get_bases(b, recurse=True) x = cmp(len(a_bases), len(b_bases)) if x != 0: return x if a.attrib['id'] in b_bases: return -1 if b.attrib['id'] in a_bases: return 1 return cmp(a.attrib['id'].count('.'), b.attrib['id'].count('.')) nodes = root.getchildren() nodes.sort(base_cmp) # Create phantom items for node in nodes: name = node.attrib['id'] doc = (node.text or '').decode('string-escape') + "\n" if doc == "\n": doc = "" # create parent, if missing parent = name while True: parent = '.'.join(parent.split('.')[:-1]) if not parent: break if parent in object_cache: break obj = imp.new_module(parent) object_cache[parent] = obj sys.modules[parent] = obj # create object if node.tag == 'module': obj = imp.new_module(name) obj.__doc__ = doc sys.modules[name] = obj elif node.tag == 'class': bases = [object_cache[b] for b in _get_bases(node) if b in object_cache] bases.append(object) init = lambda self: None init.__doc__ = doc obj = type(name, tuple(bases), {'__doc__': doc, '__init__': init}) obj.__name__ = name.split('.')[-1] elif node.tag == 'callable': funcname = node.attrib['id'].split('.')[-1] argspec = node.attrib.get('argspec') if argspec: argspec = re.sub('^[^(]*', '', argspec) doc = "%s%s\n\n%s" % (funcname, argspec, doc) obj = lambda: 0 obj.__argspec_is_invalid_ = True obj.func_name = funcname obj.__name__ = name obj.__doc__ = doc if inspect.isclass(object_cache[parent]): obj.__objclass__ = object_cache[parent] else: class Dummy(object): pass obj = Dummy() obj.__name__ = name obj.__doc__ = doc if inspect.isclass(object_cache[parent]): obj.__get__ = lambda: None object_cache[name] = obj if parent: if inspect.ismodule(object_cache[parent]): obj.__module__ = parent setattr(object_cache[parent], name.split('.')[-1], obj) # Populate items for node in root: obj = object_cache.get(node.attrib['id']) if obj is None: continue for ref in node.findall('ref'): if node.tag == 'class': if ref.attrib['ref'].startswith(node.attrib['id'] + '.'): setattr(obj, ref.attrib['name'], object_cache.get(ref.attrib['ref'])) else: setattr(obj, ref.attrib['name'], object_cache.get(ref.attrib['ref'])) joblib-0.9.4/doc/why.rst000066400000000000000000000032671264716474700151250ustar00rootroot00000000000000 Why joblib: project goals =========================== What pipelines bring us -------------------------- Pipeline processing systems can provide a set of useful features: Data-flow programming for performance ...................................... * **On-demand computing:** in pipeline systems such as labView, or VTK calculations are performed as needed by the outputs and only when inputs change. * **Transparent parallelization:** a pipeline topology can be inspected to deduce which operations can be run in parallel (it is equivalent to purely functional programming). Provenance tracking for understanding the code ............................................... * **Tracking of data and computations:** to be able to fully reproduce a computational experiment: requires tracking of the data and operation implemented. * **Inspecting data flow:** Inspecting intermediate results helps debugging and understanding. .. topic:: But pipeline frameworks can get in the way :class: warning We want our code to look like the underlying algorithm, not like a software framework. Joblib's approach -------------------- Functions are the simplest abstraction used by everyone. Our pipeline jobs (or tasks) are made of decorated functions. Tracking of parameters in a meaningful way requires specification of data model. We give up on that and use hashing for performance and robustness. Design choices --------------- * No dependencies other than Python * Robust, well-tested code, at the cost of functionality * Fast and suitable for scientific computing on big dataset without changing the original code * Only local imports: **embed joblib in your code by copying it** joblib-0.9.4/examples/000077500000000000000000000000001264716474700146255ustar00rootroot00000000000000joblib-0.9.4/examples/parallel_memmap.py000066400000000000000000000060631264716474700203340ustar00rootroot00000000000000"""Demonstrate the usage of numpy.memmap with joblib.Parallel This example shows how to preallocate data in memmap arrays both for input and output of the parallel worker processes. Sample output for this program:: [Worker 93486] Sum for row 0 is -1599.756454 [Worker 93487] Sum for row 1 is -243.253165 [Worker 93488] Sum for row 3 is 610.201883 [Worker 93489] Sum for row 2 is 187.982005 [Worker 93489] Sum for row 7 is 326.381617 [Worker 93486] Sum for row 4 is 137.324438 [Worker 93489] Sum for row 8 is -198.225809 [Worker 93487] Sum for row 5 is -1062.852066 [Worker 93488] Sum for row 6 is 1666.334107 [Worker 93486] Sum for row 9 is -463.711714 Expected sums computed in the parent process: [-1599.75645426 -243.25316471 187.98200458 610.20188337 137.32443803 -1062.85206633 1666.33410715 326.38161713 -198.22580876 -463.71171369] Actual sums computed by the worker processes: [-1599.75645426 -243.25316471 187.98200458 610.20188337 137.32443803 -1062.85206633 1666.33410715 326.38161713 -198.22580876 -463.71171369] """ import tempfile import shutil import os import numpy as np from joblib import Parallel, delayed from joblib import load, dump def sum_row(input, output, i): """Compute the sum of a row in input and store it in output""" sum_ = input[i, :].sum() print("[Worker %d] Sum for row %d is %f" % (os.getpid(), i, sum_)) output[i] = sum_ if __name__ == "__main__": rng = np.random.RandomState(42) folder = tempfile.mkdtemp() samples_name = os.path.join(folder, 'samples') sums_name = os.path.join(folder, 'sums') try: # Generate some data and an allocate an output buffer samples = rng.normal(size=(10, int(1e6))) # Pre-allocate a writeable shared memory map as a container for the # results of the parallel computation sums = np.memmap(sums_name, dtype=samples.dtype, shape=samples.shape[0], mode='w+') # Dump the input data to disk to free the memory dump(samples, samples_name) # Release the reference on the original in memory array and replace it # by a reference to the memmap array so that the garbage collector can # release the memory before forking. gc.collect() is internally called # in Parallel just before forking. samples = load(samples_name, mmap_mode='r') # Fork the worker processes to perform computation concurrently Parallel(n_jobs=4)(delayed(sum_row)(samples, sums, i) for i in range(samples.shape[0])) # Compare the results from the output buffer with the ground truth print("Expected sums computed in the parent process:") expected_result = samples.sum(axis=1) print(expected_result) print("Actual sums computed by the worker processes:") print(sums) assert np.allclose(expected_result, sums) finally: try: shutil.rmtree(folder) except: print("Failed to delete: " + folder) joblib-0.9.4/joblib/000077500000000000000000000000001264716474700142505ustar00rootroot00000000000000joblib-0.9.4/joblib/__init__.py000066400000000000000000000111611264716474700163610ustar00rootroot00000000000000""" Joblib is a set of tools to provide **lightweight pipelining in Python**. In particular, joblib offers: 1. transparent disk-caching of the output values and lazy re-evaluation (memoize pattern) 2. easy simple parallel computing 3. logging and tracing of the execution Joblib is optimized to be **fast** and **robust** in particular on large data and has specific optimizations for `numpy` arrays. It is **BSD-licensed**. ============================== ============================================ **User documentation**: http://pythonhosted.org/joblib **Download packages**: http://pypi.python.org/pypi/joblib#downloads **Source code**: http://github.com/joblib/joblib **Report issues**: http://github.com/joblib/joblib/issues ============================== ============================================ Vision -------- The vision is to provide tools to easily achieve better performance and reproducibility when working with long running jobs. * **Avoid computing twice the same thing**: code is rerun over an over, for instance when prototyping computational-heavy jobs (as in scientific development), but hand-crafted solution to alleviate this issue is error-prone and often leads to unreproducible results * **Persist to disk transparently**: persisting in an efficient way arbitrary objects containing large data is hard. Using joblib's caching mechanism avoids hand-written persistence and implicitly links the file on disk to the execution context of the original Python object. As a result, joblib's persistence is good for resuming an application status or computational job, eg after a crash. Joblib strives to address these problems while **leaving your code and your flow control as unmodified as possible** (no framework, no new paradigms). Main features ------------------ 1) **Transparent and fast disk-caching of output value:** a memoize or make-like functionality for Python functions that works well for arbitrary Python objects, including very large numpy arrays. Separate persistence and flow-execution logic from domain logic or algorithmic code by writing the operations as a set of steps with well-defined inputs and outputs: Python functions. Joblib can save their computation to disk and rerun it only if necessary:: >>> from joblib import Memory >>> mem = Memory(cachedir='/tmp/joblib') >>> import numpy as np >>> a = np.vander(np.arange(3)).astype(np.float) >>> square = mem.cache(np.square) >>> b = square(a) # doctest: +ELLIPSIS ________________________________________________________________________________ [Memory] Calling square... square(array([[ 0., 0., 1.], [ 1., 1., 1.], [ 4., 2., 1.]])) ___________________________________________________________square - 0...s, 0.0min >>> c = square(a) >>> # The above call did not trigger an evaluation 2) **Embarrassingly parallel helper:** to make it easy to write readable parallel code and debug it quickly:: >>> from joblib import Parallel, delayed >>> from math import sqrt >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] 3) **Logging/tracing:** The different functionalities will progressively acquire better logging mechanism to help track what has been ran, and capture I/O easily. In addition, Joblib will provide a few I/O primitives, to easily define logging and display streams, and provide a way of compiling a report. We want to be able to quickly inspect what has been run. 4) **Fast compressed Persistence**: a replacement for pickle to work efficiently on Python objects containing large data ( *joblib.dump* & *joblib.load* ). .. >>> import shutil ; shutil.rmtree('/tmp/joblib/') """ # PEP0440 compatible formatted version, see: # https://www.python.org/dev/peps/pep-0440/ # # Generic release markers: # X.Y # X.Y.Z # For bugfix releases # # Admissible pre-release markers: # X.YaN # Alpha release # X.YbN # Beta release # X.YrcN # Release Candidate # X.Y # Final release # # Dev branch marker is: 'X.Y.dev' or 'X.Y.devN' where N is an integer. # 'X.Y.dev0' is the canonical version of 'X.Y.dev' # __version__ = '0.9.4' from .memory import Memory, MemorizedResult from .logger import PrintTime from .logger import Logger from .hashing import hash from .numpy_pickle import dump from .numpy_pickle import load from .parallel import Parallel from .parallel import delayed from .parallel import cpu_count joblib-0.9.4/joblib/_compat.py000066400000000000000000000004151264716474700162440ustar00rootroot00000000000000""" Compatibility layer for Python 3/Python 2 single codebase """ import sys PY3_OR_LATER = sys.version_info[0] >= 3 try: _basestring = basestring _bytes_or_unicode = (str, unicode) except NameError: _basestring = str _bytes_or_unicode = (bytes, str) joblib-0.9.4/joblib/_memory_helpers.py000066400000000000000000000070251264716474700200170ustar00rootroot00000000000000try: # Available in Python 3 from tokenize import open as open_py_source except ImportError: # Copied from python3 tokenize from codecs import lookup, BOM_UTF8 import re from io import TextIOWrapper, open cookie_re = re.compile("coding[:=]\s*([-\w.]+)") def _get_normal_name(orig_enc): """Imitates get_normal_name in tokenizer.c.""" # Only care about the first 12 characters. enc = orig_enc[:12].lower().replace("_", "-") if enc == "utf-8" or enc.startswith("utf-8-"): return "utf-8" if enc in ("latin-1", "iso-8859-1", "iso-latin-1") or \ enc.startswith(("latin-1-", "iso-8859-1-", "iso-latin-1-")): return "iso-8859-1" return orig_enc def _detect_encoding(readline): """ The detect_encoding() function is used to detect the encoding that should be used to decode a Python source file. It requires one argment, readline, in the same way as the tokenize() generator. It will call readline a maximum of twice, and return the encoding used (as a string) and a list of any lines (left as bytes) it has read in. It detects the encoding from the presence of a utf-8 bom or an encoding cookie as specified in pep-0263. If both a bom and a cookie are present, but disagree, a SyntaxError will be raised. If the encoding cookie is an invalid charset, raise a SyntaxError. Note that if a utf-8 bom is found, 'utf-8-sig' is returned. If no encoding is specified, then the default of 'utf-8' will be returned. """ bom_found = False encoding = None default = 'utf-8' def read_or_stop(): try: return readline() except StopIteration: return b'' def find_cookie(line): try: line_string = line.decode('ascii') except UnicodeDecodeError: return None matches = cookie_re.findall(line_string) if not matches: return None encoding = _get_normal_name(matches[0]) try: codec = lookup(encoding) except LookupError: # This behaviour mimics the Python interpreter raise SyntaxError("unknown encoding: " + encoding) if bom_found: if codec.name != 'utf-8': # This behaviour mimics the Python interpreter raise SyntaxError('encoding problem: utf-8') encoding += '-sig' return encoding first = read_or_stop() if first.startswith(BOM_UTF8): bom_found = True first = first[3:] default = 'utf-8-sig' if not first: return default, [] encoding = find_cookie(first) if encoding: return encoding, [first] second = read_or_stop() if not second: return default, [first] encoding = find_cookie(second) if encoding: return encoding, [first, second] return default, [first, second] def open_py_source(filename): """Open a file in read only mode using the encoding detected by detect_encoding(). """ buffer = open(filename, 'rb') encoding, lines = _detect_encoding(buffer.readline) buffer.seek(0) text = TextIOWrapper(buffer, encoding, line_buffering=True) text.mode = 'r' return textjoblib-0.9.4/joblib/_multiprocessing_helpers.py000066400000000000000000000022761264716474700217410ustar00rootroot00000000000000"""Helper module to factorize the conditional multiprocessing import logic We use a distinct module to simplify import statements and avoid introducing circular dependencies (for instance for the assert_spawning name). """ import os import warnings # Obtain possible configuration from the environment, assuming 1 (on) # by default, upon 0 set to None. Should instructively fail if some non # 0/1 value is set. mp = int(os.environ.get('JOBLIB_MULTIPROCESSING', 1)) or None if mp: try: import multiprocessing as mp import multiprocessing.pool except ImportError: mp = None # 2nd stage: validate that locking is available on the system and # issue a warning if not if mp is not None: try: _sem = mp.Semaphore() del _sem # cleanup except (ImportError, OSError) as e: mp = None warnings.warn('%s. joblib will operate in serial mode' % (e,)) # 3rd stage: backward compat for the assert_spawning helper if mp is not None: try: # Python 3.4+ from multiprocessing.context import assert_spawning except ImportError: from multiprocessing.forking import assert_spawning else: assert_spawning = None joblib-0.9.4/joblib/disk.py000066400000000000000000000063201264716474700155550ustar00rootroot00000000000000""" Disk management utilities. """ # Authors: Gael Varoquaux # Lars Buitinck # Copyright (c) 2010 Gael Varoquaux # License: BSD Style, 3 clauses. import errno import os import shutil import sys import time def disk_used(path): """ Return the disk usage in a directory.""" size = 0 for file in os.listdir(path) + ['.']: stat = os.stat(os.path.join(path, file)) if hasattr(stat, 'st_blocks'): size += stat.st_blocks * 512 else: # on some platform st_blocks is not available (e.g., Windows) # approximate by rounding to next multiple of 512 size += (stat.st_size // 512 + 1) * 512 # We need to convert to int to avoid having longs on some systems (we # don't want longs to avoid problems we SQLite) return int(size / 1024.) def memstr_to_kbytes(text): """ Convert a memory text to it's value in kilobytes. """ kilo = 1024 units = dict(K=1, M=kilo, G=kilo ** 2) try: size = int(units[text[-1]] * float(text[:-1])) except (KeyError, ValueError): raise ValueError( "Invalid literal for size give: %s (type %s) should be " "alike '10G', '500M', '50K'." % (text, type(text)) ) return size def mkdirp(d): """Ensure directory d exists (like mkdir -p on Unix) No guarantee that the directory is writable. """ try: os.makedirs(d) except OSError as e: if e.errno != errno.EEXIST: raise # if a rmtree operation fails in rm_subdirs, wait for this much time (in secs), # then retry once. if it still fails, raise the exception RM_SUBDIRS_RETRY_TIME = 0.1 def rm_subdirs(path, onerror=None): """Remove all subdirectories in this path. The directory indicated by `path` is left in place, and its subdirectories are erased. If onerror is set, it is called to handle the error with arguments (func, path, exc_info) where func is os.listdir, os.remove, or os.rmdir; path is the argument to that function that caused it to fail; and exc_info is a tuple returned by sys.exc_info(). If onerror is None, an exception is raised. """ # NOTE this code is adapted from the one in shutil.rmtree, and is # just as fast names = [] try: names = os.listdir(path) except os.error as err: if onerror is not None: onerror(os.listdir, path, sys.exc_info()) else: raise for name in names: fullname = os.path.join(path, name) if os.path.isdir(fullname): if onerror is not None: shutil.rmtree(fullname, False, onerror) else: # allow the rmtree to fail once, wait and re-try. # if the error is raised again, fail err_count = 0 while True: try: shutil.rmtree(fullname, False, None) break except os.error: if err_count > 0: raise err_count += 1 time.sleep(RM_SUBDIRS_RETRY_TIME) joblib-0.9.4/joblib/format_stack.py000066400000000000000000000353131264716474700173040ustar00rootroot00000000000000""" Represent an exception with a lot of information. Provides 2 useful functions: format_exc: format an exception into a complete traceback, with full debugging instruction. format_outer_frames: format the current position in the stack call. Adapted from IPython's VerboseTB. """ # Authors: Gael Varoquaux < gael dot varoquaux at normalesup dot org > # Nathaniel Gray # Fernando Perez # Copyright: 2010, Gael Varoquaux # 2001-2004, Fernando Perez # 2001 Nathaniel Gray # License: BSD 3 clause import inspect import keyword import linecache import os import pydoc import sys import time import tokenize import traceback try: # Python 2 generate_tokens = tokenize.generate_tokens except AttributeError: # Python 3 generate_tokens = tokenize.tokenize INDENT = ' ' * 8 ############################################################################### # some internal-use functions def safe_repr(value): """Hopefully pretty robust repr equivalent.""" # this is pretty horrible but should always return *something* try: return pydoc.text.repr(value) except KeyboardInterrupt: raise except: try: return repr(value) except KeyboardInterrupt: raise except: try: # all still in an except block so we catch # getattr raising name = getattr(value, '__name__', None) if name: # ick, recursion return safe_repr(name) klass = getattr(value, '__class__', None) if klass: return '%s instance' % safe_repr(klass) except KeyboardInterrupt: raise except: return 'UNRECOVERABLE REPR FAILURE' def eq_repr(value, repr=safe_repr): return '=%s' % repr(value) ############################################################################### def uniq_stable(elems): """uniq_stable(elems) -> list Return from an iterable, a list of all the unique elements in the input, but maintaining the order in which they first appear. A naive solution to this problem which just makes a dictionary with the elements as keys fails to respect the stability condition, since dictionaries are unsorted by nature. Note: All elements in the input must be hashable. """ unique = [] unique_set = set() for nn in elems: if nn not in unique_set: unique.append(nn) unique_set.add(nn) return unique ############################################################################### def fix_frame_records_filenames(records): """Try to fix the filenames in each record from inspect.getinnerframes(). Particularly, modules loaded from within zip files have useless filenames attached to their code object, and inspect.getinnerframes() just uses it. """ fixed_records = [] for frame, filename, line_no, func_name, lines, index in records: # Look inside the frame's globals dictionary for __file__, which should # be better. better_fn = frame.f_globals.get('__file__', None) if isinstance(better_fn, str): # Check the type just in case someone did something weird with # __file__. It might also be None if the error occurred during # import. filename = better_fn fixed_records.append((frame, filename, line_no, func_name, lines, index)) return fixed_records def _fixed_getframes(etb, context=1, tb_offset=0): LNUM_POS, LINES_POS, INDEX_POS = 2, 4, 5 records = fix_frame_records_filenames(inspect.getinnerframes(etb, context)) # If the error is at the console, don't build any context, since it would # otherwise produce 5 blank lines printed out (there is no file at the # console) rec_check = records[tb_offset:] try: rname = rec_check[0][1] if rname == '' or rname.endswith(''): return rec_check except IndexError: pass aux = traceback.extract_tb(etb) assert len(records) == len(aux) for i, (file, lnum, _, _) in enumerate(aux): maybeStart = lnum - 1 - context // 2 start = max(maybeStart, 0) end = start + context lines = linecache.getlines(file)[start:end] # pad with empty lines if necessary if maybeStart < 0: lines = (['\n'] * -maybeStart) + lines if len(lines) < context: lines += ['\n'] * (context - len(lines)) buf = list(records[i]) buf[LNUM_POS] = lnum buf[INDEX_POS] = lnum - 1 - start buf[LINES_POS] = lines records[i] = tuple(buf) return records[tb_offset:] def _format_traceback_lines(lnum, index, lines, lvals=None): numbers_width = 7 res = [] i = lnum - index for line in lines: if i == lnum: # This is the line with the error pad = numbers_width - len(str(i)) if pad >= 3: marker = '-' * (pad - 3) + '-> ' elif pad == 2: marker = '> ' elif pad == 1: marker = '>' else: marker = '' num = marker + str(i) else: num = '%*s' % (numbers_width, i) line = '%s %s' % (num, line) res.append(line) if lvals and i == lnum: res.append(lvals + '\n') i = i + 1 return res def format_records(records): # , print_globals=False): # Loop over all records printing context and info frames = [] abspath = os.path.abspath for frame, file, lnum, func, lines, index in records: try: file = file and abspath(file) or '?' except OSError: # if file is '' or something not in the filesystem, # the abspath call will throw an OSError. Just ignore it and # keep the original file string. pass if file.endswith('.pyc'): file = file[:-4] + '.py' link = file args, varargs, varkw, locals = inspect.getargvalues(frame) if func == '?': call = '' else: # Decide whether to include variable details or not try: call = 'in %s%s' % (func, inspect.formatargvalues(args, varargs, varkw, locals, formatvalue=eq_repr)) except KeyError: # Very odd crash from inspect.formatargvalues(). The # scenario under which it appeared was a call to # view(array,scale) in NumTut.view.view(), where scale had # been defined as a scalar (it should be a tuple). Somehow # inspect messes up resolving the argument list of view() # and barfs out. At some point I should dig into this one # and file a bug report about it. print("\nJoblib's exception reporting continues...\n") call = 'in %s(***failed resolving arguments***)' % func # Initialize a list of names on the current line, which the # tokenizer below will populate. names = [] def tokeneater(token_type, token, start, end, line): """Stateful tokeneater which builds dotted names. The list of names it appends to (from the enclosing scope) can contain repeated composite names. This is unavoidable, since there is no way to disambiguate partial dotted structures until the full list is known. The caller is responsible for pruning the final list of duplicates before using it.""" # build composite names if token == '.': try: names[-1] += '.' # store state so the next token is added for x.y.z names tokeneater.name_cont = True return except IndexError: pass if token_type == tokenize.NAME and token not in keyword.kwlist: if tokeneater.name_cont: # Dotted names names[-1] += token tokeneater.name_cont = False else: # Regular new names. We append everything, the caller # will be responsible for pruning the list later. It's # very tricky to try to prune as we go, b/c composite # names can fool us. The pruning at the end is easy # to do (or the caller can print a list with repeated # names if so desired. names.append(token) elif token_type == tokenize.NEWLINE: raise IndexError # we need to store a bit of state in the tokenizer to build # dotted names tokeneater.name_cont = False def linereader(file=file, lnum=[lnum], getline=linecache.getline): line = getline(file, lnum[0]) lnum[0] += 1 return line # Build the list of names on this line of code where the exception # occurred. try: # This builds the names list in-place by capturing it from the # enclosing scope. for token in generate_tokens(linereader): tokeneater(*token) except (IndexError, UnicodeDecodeError): # signals exit of tokenizer pass except tokenize.TokenError as msg: _m = ("An unexpected error occurred while tokenizing input file %s\n" "The following traceback may be corrupted or invalid\n" "The error message is: %s\n" % (file, msg)) print(_m) # prune names list of duplicates, but keep the right order unique_names = uniq_stable(names) # Start loop over vars lvals = [] for name_full in unique_names: name_base = name_full.split('.', 1)[0] if name_base in frame.f_code.co_varnames: if name_base in locals.keys(): try: value = safe_repr(eval(name_full, locals)) except: value = "undefined" else: value = "undefined" name = name_full lvals.append('%s = %s' % (name, value)) #elif print_globals: # if frame.f_globals.has_key(name_base): # try: # value = safe_repr(eval(name_full,frame.f_globals)) # except: # value = "undefined" # else: # value = "undefined" # name = 'global %s' % name_full # lvals.append('%s = %s' % (name,value)) if lvals: lvals = '%s%s' % (INDENT, ('\n%s' % INDENT).join(lvals)) else: lvals = '' level = '%s\n%s %s\n' % (75 * '.', link, call) if index is None: frames.append(level) else: frames.append('%s%s' % (level, ''.join( _format_traceback_lines(lnum, index, lines, lvals)))) return frames ############################################################################### def format_exc(etype, evalue, etb, context=5, tb_offset=0): """ Return a nice text document describing the traceback. Parameters ----------- etype, evalue, etb: as returned by sys.exc_info context: number of lines of the source file to plot tb_offset: the number of stack frame not to use (0 = use all) """ # some locals try: etype = etype.__name__ except AttributeError: pass # Header with the exception type, python version, and date pyver = 'Python ' + sys.version.split()[0] + ': ' + sys.executable date = time.ctime(time.time()) pid = 'PID: %i' % os.getpid() head = '%s%s%s\n%s%s%s' % ( etype, ' ' * (75 - len(str(etype)) - len(date)), date, pid, ' ' * (75 - len(str(pid)) - len(pyver)), pyver) # Drop topmost frames if requested try: records = _fixed_getframes(etb, context, tb_offset) except: raise print('\nUnfortunately, your original traceback can not be ' 'constructed.\n') return '' # Get (safely) a string form of the exception info try: etype_str, evalue_str = map(str, (etype, evalue)) except: # User exception is improperly defined. etype, evalue = str, sys.exc_info()[:2] etype_str, evalue_str = map(str, (etype, evalue)) # ... and format it exception = ['%s: %s' % (etype_str, evalue_str)] frames = format_records(records) return '%s\n%s\n%s' % (head, '\n'.join(frames), ''.join(exception[0])) ############################################################################### def format_outer_frames(context=5, stack_start=None, stack_end=None, ignore_ipython=True): LNUM_POS, LINES_POS, INDEX_POS = 2, 4, 5 records = inspect.getouterframes(inspect.currentframe()) output = list() for i, (frame, filename, line_no, func_name, lines, index) \ in enumerate(records): # Look inside the frame's globals dictionary for __file__, which should # be better. better_fn = frame.f_globals.get('__file__', None) if isinstance(better_fn, str): # Check the type just in case someone did something weird with # __file__. It might also be None if the error occurred during # import. filename = better_fn if filename.endswith('.pyc'): filename = filename[:-4] + '.py' if ignore_ipython: # Hack to avoid printing the internals of IPython if (os.path.basename(filename) == 'iplib.py' and func_name in ('safe_execfile', 'runcode')): break maybeStart = line_no - 1 - context // 2 start = max(maybeStart, 0) end = start + context lines = linecache.getlines(filename)[start:end] # pad with empty lines if necessary if maybeStart < 0: lines = (['\n'] * -maybeStart) + lines if len(lines) < context: lines += ['\n'] * (context - len(lines)) buf = list(records[i]) buf[LNUM_POS] = line_no buf[INDEX_POS] = line_no - 1 - start buf[LINES_POS] = lines output.append(tuple(buf)) return '\n'.join(format_records(output[stack_end:stack_start:-1])) joblib-0.9.4/joblib/func_inspect.py000066400000000000000000000314701264716474700173070ustar00rootroot00000000000000""" My own variation on function-specific inspect-like features. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. from itertools import islice import inspect import warnings import re import os from ._compat import _basestring from .logger import pformat from ._memory_helpers import open_py_source from ._compat import PY3_OR_LATER def get_func_code(func): """ Attempts to retrieve a reliable function code hash. The reason we don't use inspect.getsource is that it caches the source, whereas we want this to be modified on the fly when the function is modified. Returns ------- func_code: string The function code source_file: string The path to the file in which the function is defined. first_line: int The first line of the code in the source file. Notes ------ This function does a bit more magic than inspect, and is thus more robust. """ source_file = None try: code = func.__code__ source_file = code.co_filename if not os.path.exists(source_file): # Use inspect for lambda functions and functions defined in an # interactive shell, or in doctests source_code = ''.join(inspect.getsourcelines(func)[0]) line_no = 1 if source_file.startswith('', source_file).groups() line_no = int(line_no) source_file = '' % source_file return source_code, source_file, line_no # Try to retrieve the source code. with open_py_source(source_file) as source_file_obj: first_line = code.co_firstlineno # All the lines after the function definition: source_lines = list(islice(source_file_obj, first_line - 1, None)) return ''.join(inspect.getblock(source_lines)), source_file, first_line except: # If the source code fails, we use the hash. This is fragile and # might change from one session to another. if hasattr(func, '__code__'): # Python 3.X return str(func.__code__.__hash__()), source_file, -1 else: # Weird objects like numpy ufunc don't have __code__ # This is fragile, as quite often the id of the object is # in the repr, so it might not persist across sessions, # however it will work for ufuncs. return repr(func), source_file, -1 def _clean_win_chars(string): """Windows cannot encode some characters in filename.""" import urllib if hasattr(urllib, 'quote'): quote = urllib.quote else: # In Python 3, quote is elsewhere import urllib.parse quote = urllib.parse.quote for char in ('<', '>', '!', ':', '\\'): string = string.replace(char, quote(char)) return string def get_func_name(func, resolv_alias=True, win_characters=True): """ Return the function import path (as a list of module names), and a name for the function. Parameters ---------- func: callable The func to inspect resolv_alias: boolean, optional If true, possible local aliases are indicated. win_characters: boolean, optional If true, substitute special characters using urllib.quote This is useful in Windows, as it cannot encode some filenames """ if hasattr(func, '__module__'): module = func.__module__ else: try: module = inspect.getmodule(func) except TypeError: if hasattr(func, '__class__'): module = func.__class__.__module__ else: module = 'unknown' if module is None: # Happens in doctests, eg module = '' if module == '__main__': try: filename = os.path.abspath(inspect.getsourcefile(func)) except: filename = None if filename is not None: # mangling of full path to filename parts = filename.split(os.sep) if parts[-1].startswith(' 1500: arg = '%s...' % arg[:700] if previous_length > 80: arg = '\n%s' % arg previous_length = len(arg) arg_str.append(arg) arg_str.extend(['%s=%s' % (v, pformat(i)) for v, i in kwargs.items()]) arg_str = ', '.join(arg_str) signature = '%s(%s)' % (name, arg_str) return module_path, signature def format_call(func, args, kwargs, object_name="Memory"): """ Returns a nicely formatted statement displaying the function call with the given arguments. """ path, signature = format_signature(func, *args, **kwargs) msg = '%s\n[%s] Calling %s...\n%s' % (80 * '_', object_name, path, signature) return msg # XXX: Not using logging framework #self.debug(msg) joblib-0.9.4/joblib/hashing.py000066400000000000000000000235201264716474700162450ustar00rootroot00000000000000""" Fast cryptographic hash of Python objects, with a special case for fast hashing of numpy arrays. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import pickle import hashlib import sys import types import struct import io from ._compat import _bytes_or_unicode, PY3_OR_LATER if PY3_OR_LATER: Pickler = pickle._Pickler else: Pickler = pickle.Pickler class _ConsistentSet(object): """ Class used to ensure the hash of Sets is preserved whatever the order of its items. """ def __init__(self, set_sequence): # Forces order of elements in set to ensure consistent hash. try: # Trying first to order the set assuming the type of elements is # consistent and orderable. # This fails on python 3 when elements are unorderable # but we keep it in a try as it's faster. self._sequence = sorted(set_sequence) except TypeError: # If elements are unorderable, sorting them using their hash. # This is slower but works in any case. self._sequence = sorted((hash(e) for e in set_sequence)) class _MyHash(object): """ Class used to hash objects that won't normally pickle """ def __init__(self, *args): self.args = args class Hasher(Pickler): """ A subclass of pickler, to do cryptographic hashing, rather than pickling. """ def __init__(self, hash_name='md5'): self.stream = io.BytesIO() # By default we want a pickle protocol that only changes with # the major python version and not the minor one protocol = (pickle.DEFAULT_PROTOCOL if PY3_OR_LATER else pickle.HIGHEST_PROTOCOL) Pickler.__init__(self, self.stream, protocol=protocol) # Initialise the hash obj self._hash = hashlib.new(hash_name) def hash(self, obj, return_digest=True): try: self.dump(obj) except pickle.PicklingError as e: e.args += ('PicklingError while hashing %r: %r' % (obj, e),) raise dumps = self.stream.getvalue() self._hash.update(dumps) if return_digest: return self._hash.hexdigest() def save(self, obj): if isinstance(obj, (types.MethodType, type({}.pop))): # the Pickler cannot pickle instance methods; here we decompose # them into components that make them uniquely identifiable if hasattr(obj, '__func__'): func_name = obj.__func__.__name__ else: func_name = obj.__name__ inst = obj.__self__ if type(inst) == type(pickle): obj = _MyHash(func_name, inst.__name__) elif inst is None: # type(None) or type(module) do not pickle obj = _MyHash(func_name, inst) else: cls = obj.__self__.__class__ obj = _MyHash(func_name, inst, cls) Pickler.save(self, obj) def memoize(self, obj): # We want hashing to be sensitive to value instead of reference. # For example we want ['aa', 'aa'] and ['aa', 'aaZ'[:2]] # to hash to the same value and that's why we disable memoization # for strings if isinstance(obj, _bytes_or_unicode): return Pickler.memoize(self, obj) # The dispatch table of the pickler is not accessible in Python # 3, as these lines are only bugware for IPython, we skip them. def save_global(self, obj, name=None, pack=struct.pack): # We have to override this method in order to deal with objects # defined interactively in IPython that are not injected in # __main__ kwargs = dict(name=name, pack=pack) if sys.version_info >= (3, 4): del kwargs['pack'] try: Pickler.save_global(self, obj, **kwargs) except pickle.PicklingError: Pickler.save_global(self, obj, **kwargs) module = getattr(obj, "__module__", None) if module == '__main__': my_name = name if my_name is None: my_name = obj.__name__ mod = sys.modules[module] if not hasattr(mod, my_name): # IPython doesn't inject the variables define # interactively in __main__ setattr(mod, my_name, obj) dispatch = Pickler.dispatch.copy() # builtin dispatch[type(len)] = save_global # type dispatch[type(object)] = save_global # classobj dispatch[type(Pickler)] = save_global # function dispatch[type(pickle.dump)] = save_global def _batch_setitems(self, items): # forces order of keys in dict to ensure consistent hash. try: # Trying first to compare dict assuming the type of keys is # consistent and orderable. # This fails on python 3 when keys are unorderable # but we keep it in a try as it's faster. Pickler._batch_setitems(self, iter(sorted(items))) except TypeError: # If keys are unorderable, sorting them using their hash. This is # slower but works in any case. Pickler._batch_setitems(self, iter(sorted((hash(k), v) for k, v in items))) def save_set(self, set_items): # forces order of items in Set to ensure consistent hash Pickler.save(self, _ConsistentSet(set_items)) dispatch[type(set())] = save_set class NumpyHasher(Hasher): """ Special case the hasher for when numpy is loaded. """ def __init__(self, hash_name='md5', coerce_mmap=False): """ Parameters ---------- hash_name: string The hash algorithm to be used coerce_mmap: boolean Make no difference between np.memmap and np.ndarray objects. """ self.coerce_mmap = coerce_mmap Hasher.__init__(self, hash_name=hash_name) # delayed import of numpy, to avoid tight coupling import numpy as np self.np = np if hasattr(np, 'getbuffer'): self._getbuffer = np.getbuffer else: self._getbuffer = memoryview def save(self, obj): """ Subclass the save method, to hash ndarray subclass, rather than pickling them. Off course, this is a total abuse of the Pickler class. """ if isinstance(obj, self.np.ndarray) and not obj.dtype.hasobject: # Compute a hash of the object: try: # memoryview is not supported for some dtypes, # e.g. datetime64, see # https://github.com/numpy/numpy/issues/4983. The # workaround is to view the array as bytes before # taking the memoryview obj_bytes_view = obj.view(self.np.uint8) self._hash.update(self._getbuffer(obj_bytes_view)) # ValueError is raised by .view when the array is not contiguous # BufferError is raised by Python 3 in the hash update if # the array is Fortran rather than C contiguous except (ValueError, BufferError): # Cater for non-single-segment arrays: this creates a # copy, and thus aleviates this issue. # XXX: There might be a more efficient way of doing this obj_bytes_view = obj.flatten().view(self.np.uint8) self._hash.update(self._getbuffer(obj_bytes_view)) # We store the class, to be able to distinguish between # Objects with the same binary content, but different # classes. if self.coerce_mmap and isinstance(obj, self.np.memmap): # We don't make the difference between memmap and # normal ndarrays, to be able to reload previously # computed results with memmap. klass = self.np.ndarray else: klass = obj.__class__ # We also return the dtype and the shape, to distinguish # different views on the same data with different dtypes. # The object will be pickled by the pickler hashed at the end. obj = (klass, ('HASHED', obj.dtype, obj.shape, obj.strides)) elif isinstance(obj, self.np.dtype): # Atomic dtype objects are interned by their default constructor: # np.dtype('f8') is np.dtype('f8') # This interning is not maintained by a # pickle.loads + pickle.dumps cycle, because __reduce__ # uses copy=True in the dtype constructor. This # non-deterministic behavior causes the internal memoizer # of the hasher to generate different hash values # depending on the history of the dtype object. # To prevent the hash from being sensitive to this, we use # .descr which is a full (and never interned) description of # the array dtype according to the numpy doc. klass = obj.__class__ obj = (klass, ('HASHED', obj.descr)) Hasher.save(self, obj) def hash(obj, hash_name='md5', coerce_mmap=False): """ Quick calculation of a hash to identify uniquely Python objects containing numpy arrays. Parameters ----------- hash_name: 'md5' or 'sha1' Hashing algorithm used. sha1 is supposedly safer, but md5 is faster. coerce_mmap: boolean Make no difference between np.memmap and np.ndarray """ if 'numpy' in sys.modules: hasher = NumpyHasher(hash_name=hash_name, coerce_mmap=coerce_mmap) else: hasher = Hasher(hash_name=hash_name) return hasher.hash(obj) joblib-0.9.4/joblib/logger.py000066400000000000000000000120171264716474700161020ustar00rootroot00000000000000""" Helpers for logging. This module needs much love to become useful. """ # Author: Gael Varoquaux # Copyright (c) 2008 Gael Varoquaux # License: BSD Style, 3 clauses. from __future__ import print_function import time import sys import os import shutil import logging import pprint from .disk import mkdirp def _squeeze_time(t): """Remove .1s to the time under Windows: this is the time it take to stat files. This is needed to make results similar to timings under Unix, for tests """ if sys.platform.startswith('win'): return max(0, t - .1) else: return t def format_time(t): t = _squeeze_time(t) return "%.1fs, %.1fmin" % (t, t / 60.) def short_format_time(t): t = _squeeze_time(t) if t > 60: return "%4.1fmin" % (t / 60.) else: return " %5.1fs" % (t) def pformat(obj, indent=0, depth=3): if 'numpy' in sys.modules: import numpy as np print_options = np.get_printoptions() np.set_printoptions(precision=6, threshold=64, edgeitems=1) else: print_options = None out = pprint.pformat(obj, depth=depth, indent=indent) if print_options: np.set_printoptions(**print_options) return out ############################################################################### # class `Logger` ############################################################################### class Logger(object): """ Base class for logging messages. """ def __init__(self, depth=3): """ Parameters ---------- depth: int, optional The depth of objects printed. """ self.depth = depth def warn(self, msg): logging.warn("[%s]: %s" % (self, msg)) def debug(self, msg): # XXX: This conflicts with the debug flag used in children class logging.debug("[%s]: %s" % (self, msg)) def format(self, obj, indent=0): """ Return the formated representation of the object. """ return pformat(obj, indent=indent, depth=self.depth) ############################################################################### # class `PrintTime` ############################################################################### class PrintTime(object): """ Print and log messages while keeping track of time. """ def __init__(self, logfile=None, logdir=None): if logfile is not None and logdir is not None: raise ValueError('Cannot specify both logfile and logdir') # XXX: Need argument docstring self.last_time = time.time() self.start_time = self.last_time if logdir is not None: logfile = os.path.join(logdir, 'joblib.log') self.logfile = logfile if logfile is not None: mkdirp(os.path.dirname(logfile)) if os.path.exists(logfile): # Rotate the logs for i in range(1, 9): try: shutil.move(logfile + '.%i' % i, logfile + '.%i' % (i + 1)) except: "No reason failing here" # Use a copy rather than a move, so that a process # monitoring this file does not get lost. try: shutil.copy(logfile, logfile + '.1') except: "No reason failing here" try: with open(logfile, 'w') as logfile: logfile.write('\nLogging joblib python script\n') logfile.write('\n---%s---\n' % time.ctime(self.last_time)) except: """ Multiprocessing writing to files can create race conditions. Rather fail silently than crash the computation. """ # XXX: We actually need a debug flag to disable this # silent failure. def __call__(self, msg='', total=False): """ Print the time elapsed between the last call and the current call, with an optional message. """ if not total: time_lapse = time.time() - self.last_time full_msg = "%s: %s" % (msg, format_time(time_lapse)) else: # FIXME: Too much logic duplicated time_lapse = time.time() - self.start_time full_msg = "%s: %.2fs, %.1f min" % (msg, time_lapse, time_lapse / 60) print(full_msg, file=sys.stderr) if self.logfile is not None: try: with open(self.logfile, 'a') as f: print(full_msg, file=f) except: """ Multiprocessing writing to files can create race conditions. Rather fail silently than crash the calculation. """ # XXX: We actually need a debug flag to disable this # silent failure. self.last_time = time.time() joblib-0.9.4/joblib/memory.py000066400000000000000000001067541264716474700161470ustar00rootroot00000000000000""" A context object for caching a function's return value each time it is called with the same input arguments. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. from __future__ import with_statement import os import shutil import time import pydoc import re import sys try: import cPickle as pickle except ImportError: import pickle import functools import traceback import warnings import inspect import json import weakref import io # Local imports from . import hashing from .func_inspect import get_func_code, get_func_name, filter_args from .func_inspect import format_signature, format_call from ._memory_helpers import open_py_source from .logger import Logger, format_time, pformat from . import numpy_pickle from .disk import mkdirp, rm_subdirs from ._compat import _basestring, PY3_OR_LATER FIRST_LINE_TEXT = "# first line:" # TODO: The following object should have a data store object as a sub # object, and the interface to persist and query should be separated in # the data store. # # This would enable creating 'Memory' objects with a different logic for # pickling that would simply span a MemorizedFunc with the same # store (or do we want to copy it to avoid cross-talks?), for instance to # implement HDF5 pickling. # TODO: Same remark for the logger, and probably use the Python logging # mechanism. def extract_first_line(func_code): """ Extract the first line information from the function code text if available. """ if func_code.startswith(FIRST_LINE_TEXT): func_code = func_code.split('\n') first_line = int(func_code[0][len(FIRST_LINE_TEXT):]) func_code = '\n'.join(func_code[1:]) else: first_line = -1 return func_code, first_line class JobLibCollisionWarning(UserWarning): """ Warn that there might be a collision between names of functions. """ def _get_func_fullname(func): """Compute the part of part associated with a function. See code of_cache_key_to_dir() for details """ modules, funcname = get_func_name(func) modules.append(funcname) return os.path.join(*modules) def _cache_key_to_dir(cachedir, func, argument_hash): """Compute directory associated with a given cache key. func can be a function or a string as returned by _get_func_fullname(). """ parts = [cachedir] if isinstance(func, _basestring): parts.append(func) else: parts.append(_get_func_fullname(func)) if argument_hash is not None: parts.append(argument_hash) return os.path.join(*parts) def _load_output(output_dir, func_name, timestamp=None, metadata=None, mmap_mode=None, verbose=0): """Load output of a computation.""" if verbose > 1: signature = "" try: if metadata is not None: args = ", ".join(['%s=%s' % (name, value) for name, value in metadata['input_args'].items()]) signature = "%s(%s)" % (os.path.basename(func_name), args) else: signature = os.path.basename(func_name) except KeyError: pass if timestamp is not None: t = "% 16s" % format_time(time.time() - timestamp) else: t = "" if verbose < 10: print('[Memory]%s: Loading %s...' % (t, str(signature))) else: print('[Memory]%s: Loading %s from %s' % ( t, str(signature), output_dir)) filename = os.path.join(output_dir, 'output.pkl') if not os.path.isfile(filename): raise KeyError( "Non-existing cache value (may have been cleared).\n" "File %s does not exist" % filename) return numpy_pickle.load(filename, mmap_mode=mmap_mode) # An in-memory store to avoid looking at the disk-based function # source code to check if a function definition has changed _FUNCTION_HASHES = weakref.WeakKeyDictionary() ############################################################################### # class `MemorizedResult` ############################################################################### class MemorizedResult(Logger): """Object representing a cached value. Attributes ---------- cachedir: string path to root of joblib cache func: function or string function whose output is cached. The string case is intended only for instanciation based on the output of repr() on another instance. (namely eval(repr(memorized_instance)) works). argument_hash: string hash of the function arguments mmap_mode: {None, 'r+', 'r', 'w+', 'c'} The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the different values. verbose: int verbosity level (0 means no message) timestamp, metadata: string for internal use only """ def __init__(self, cachedir, func, argument_hash, mmap_mode=None, verbose=0, timestamp=None, metadata=None): Logger.__init__(self) if isinstance(func, _basestring): self.func = func else: self.func = _get_func_fullname(func) self.argument_hash = argument_hash self.cachedir = cachedir self.mmap_mode = mmap_mode self._output_dir = _cache_key_to_dir(cachedir, self.func, argument_hash) if metadata is not None: self.metadata = metadata else: self.metadata = {} # No error is relevant here. try: with open(os.path.join(self._output_dir, 'metadata.json'), 'rb') as f: self.metadata = json.load(f) except: pass self.duration = self.metadata.get('duration', None) self.verbose = verbose self.timestamp = timestamp def get(self): """Read value from cache and return it.""" return _load_output(self._output_dir, _get_func_fullname(self.func), timestamp=self.timestamp, metadata=self.metadata, mmap_mode=self.mmap_mode, verbose=self.verbose) def clear(self): """Clear value from cache""" shutil.rmtree(self._output_dir, ignore_errors=True) def __repr__(self): return ('{class_name}(cachedir="{cachedir}", func="{func}", ' 'argument_hash="{argument_hash}")'.format( class_name=self.__class__.__name__, cachedir=self.cachedir, func=self.func, argument_hash=self.argument_hash )) def __reduce__(self): return (self.__class__, (self.cachedir, self.func, self.argument_hash), {'mmap_mode': self.mmap_mode}) class NotMemorizedResult(object): """Class representing an arbitrary value. This class is a replacement for MemorizedResult when there is no cache. """ __slots__ = ('value', 'valid') def __init__(self, value): self.value = value self.valid = True def get(self): if self.valid: return self.value else: raise KeyError("No value stored.") def clear(self): self.valid = False self.value = None def __repr__(self): if self.valid: return '{class_name}({value})'.format( class_name=self.__class__.__name__, value=pformat(self.value) ) else: return self.__class__.__name__ + ' with no value' # __getstate__ and __setstate__ are required because of __slots__ def __getstate__(self): return {"valid": self.valid, "value": self.value} def __setstate__(self, state): self.valid = state["valid"] self.value = state["value"] ############################################################################### # class `NotMemorizedFunc` ############################################################################### class NotMemorizedFunc(object): """No-op object decorating a function. This class replaces MemorizedFunc when there is no cache. It provides an identical API but does not write anything on disk. Attributes ---------- func: callable Original undecorated function. """ # Should be a light as possible (for speed) def __init__(self, func): self.func = func def __call__(self, *args, **kwargs): return self.func(*args, **kwargs) def call_and_shelve(self, *args, **kwargs): return NotMemorizedResult(self.func(*args, **kwargs)) def __reduce__(self): return (self.__class__, (self.func,)) def __repr__(self): return '%s(func=%s)' % ( self.__class__.__name__, self.func ) def clear(self, warn=True): # Argument "warn" is for compatibility with MemorizedFunc.clear pass ############################################################################### # class `MemorizedFunc` ############################################################################### class MemorizedFunc(Logger): """ Callable object decorating a function for caching its return value each time it is called. All values are cached on the filesystem, in a deep directory structure. Methods are provided to inspect the cache or clean it. Attributes ---------- func: callable The original, undecorated, function. cachedir: string Path to the base cache directory of the memory context. ignore: list or None List of variable names to ignore when choosing whether to recompute. mmap_mode: {None, 'r+', 'r', 'w+', 'c'} The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the different values. compress: boolean, or integer Whether to zip the stored data on disk. If an integer is given, it should be between 1 and 9, and sets the amount of compression. Note that compressed arrays cannot be read by memmapping. verbose: int, optional The verbosity flag, controls messages that are issued as the function is evaluated. """ #------------------------------------------------------------------------- # Public interface #------------------------------------------------------------------------- def __init__(self, func, cachedir, ignore=None, mmap_mode=None, compress=False, verbose=1, timestamp=None): """ Parameters ---------- func: callable The function to decorate cachedir: string The path of the base directory to use as a data store ignore: list or None List of variable names to ignore. mmap_mode: {None, 'r+', 'r', 'w+', 'c'}, optional The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the arguments. compress : boolean, or integer Whether to zip the stored data on disk. If an integer is given, it should be between 1 and 9, and sets the amount of compression. Note that compressed arrays cannot be read by memmapping. verbose: int, optional Verbosity flag, controls the debug messages that are issued as functions are evaluated. The higher, the more verbose timestamp: float, optional The reference time from which times in tracing messages are reported. """ Logger.__init__(self) self.mmap_mode = mmap_mode self.func = func if ignore is None: ignore = [] self.ignore = ignore self._verbose = verbose self.cachedir = cachedir self.compress = compress if compress and self.mmap_mode is not None: warnings.warn('Compressed results cannot be memmapped', stacklevel=2) if timestamp is None: timestamp = time.time() self.timestamp = timestamp mkdirp(self.cachedir) try: functools.update_wrapper(self, func) except: " Objects like ufunc don't like that " if inspect.isfunction(func): doc = pydoc.TextDoc().document(func) # Remove blank line doc = doc.replace('\n', '\n\n', 1) # Strip backspace-overprints for compatibility with autodoc doc = re.sub('\x08.', '', doc) else: # Pydoc does a poor job on other objects doc = func.__doc__ self.__doc__ = 'Memoized version of %s' % doc def _cached_call(self, args, kwargs): """Call wrapped function and cache result, or read cache if available. This function returns the wrapped function output and some metadata. Returns ------- output: value or tuple what is returned by wrapped function argument_hash: string hash of function arguments metadata: dict some metadata about wrapped function call (see _persist_input()) """ # Compare the function code with the previous to see if the # function code has changed output_dir, argument_hash = self._get_output_dir(*args, **kwargs) metadata = None # FIXME: The statements below should be try/excepted if not (self._check_previous_func_code(stacklevel=4) and os.path.exists(output_dir)): if self._verbose > 10: _, name = get_func_name(self.func) self.warn('Computing func %s, argument hash %s in ' 'directory %s' % (name, argument_hash, output_dir)) out, metadata = self.call(*args, **kwargs) if self.mmap_mode is not None: # Memmap the output at the first call to be consistent with # later calls out = _load_output(output_dir, _get_func_fullname(self.func), timestamp=self.timestamp, mmap_mode=self.mmap_mode, verbose=self._verbose) else: try: t0 = time.time() out = _load_output(output_dir, _get_func_fullname(self.func), timestamp=self.timestamp, metadata=metadata, mmap_mode=self.mmap_mode, verbose=self._verbose) if self._verbose > 4: t = time.time() - t0 _, name = get_func_name(self.func) msg = '%s cache loaded - %s' % (name, format_time(t)) print(max(0, (80 - len(msg))) * '_' + msg) except Exception: # XXX: Should use an exception logger self.warn('Exception while loading results for ' '(args=%s, kwargs=%s)\n %s' % (args, kwargs, traceback.format_exc())) shutil.rmtree(output_dir, ignore_errors=True) out, metadata = self.call(*args, **kwargs) argument_hash = None return (out, argument_hash, metadata) def call_and_shelve(self, *args, **kwargs): """Call wrapped function, cache result and return a reference. This method returns a reference to the cached result instead of the result itself. The reference object is small and pickeable, allowing to send or store it easily. Call .get() on reference object to get result. Returns ------- cached_result: MemorizedResult or NotMemorizedResult reference to the value returned by the wrapped function. The class "NotMemorizedResult" is used when there is no cache activated (e.g. cachedir=None in Memory). """ _, argument_hash, metadata = self._cached_call(args, kwargs) return MemorizedResult(self.cachedir, self.func, argument_hash, metadata=metadata, verbose=self._verbose - 1, timestamp=self.timestamp) def __call__(self, *args, **kwargs): return self._cached_call(args, kwargs)[0] def __reduce__(self): """ We don't store the timestamp when pickling, to avoid the hash depending from it. In addition, when unpickling, we run the __init__ """ return (self.__class__, (self.func, self.cachedir, self.ignore, self.mmap_mode, self.compress, self._verbose)) def format_signature(self, *args, **kwargs): warnings.warn("MemorizedFunc.format_signature will be removed in a " "future version of joblib.", DeprecationWarning) return format_signature(self.func, *args, **kwargs) def format_call(self, *args, **kwargs): warnings.warn("MemorizedFunc.format_call will be removed in a " "future version of joblib.", DeprecationWarning) return format_call(self.func, args, kwargs) #------------------------------------------------------------------------- # Private interface #------------------------------------------------------------------------- def _get_argument_hash(self, *args, **kwargs): return hashing.hash(filter_args(self.func, self.ignore, args, kwargs), coerce_mmap=(self.mmap_mode is not None)) def _get_output_dir(self, *args, **kwargs): """ Return the directory in which are persisted the result of the function called with the given arguments. """ argument_hash = self._get_argument_hash(*args, **kwargs) output_dir = os.path.join(self._get_func_dir(self.func), argument_hash) return output_dir, argument_hash get_output_dir = _get_output_dir # backward compatibility def _get_func_dir(self, mkdir=True): """ Get the directory corresponding to the cache for the function. """ func_dir = _cache_key_to_dir(self.cachedir, self.func, None) if mkdir: mkdirp(func_dir) return func_dir def _hash_func(self): """Hash a function to key the online cache""" func_code_h = hash(getattr(self.func, '__code__', None)) return id(self.func), hash(self.func), func_code_h def _write_func_code(self, filename, func_code, first_line): """ Write the function code and the filename to a file. """ # We store the first line because the filename and the function # name is not always enough to identify a function: people # sometimes have several functions named the same way in a # file. This is bad practice, but joblib should be robust to bad # practice. func_code = u'%s %i\n%s' % (FIRST_LINE_TEXT, first_line, func_code) with io.open(filename, 'w', encoding="UTF-8") as out: out.write(func_code) # Also store in the in-memory store of function hashes is_named_callable = False if PY3_OR_LATER: is_named_callable = (hasattr(self.func, '__name__') and self.func.__name__ != '') else: is_named_callable = (hasattr(self.func, 'func_name') and self.func.func_name != '') if is_named_callable: # Don't do this for lambda functions or strange callable # objects, as it ends up being too fragile func_hash = self._hash_func() try: _FUNCTION_HASHES[self.func] = func_hash except TypeError: # Some callable are not hashable pass def _check_previous_func_code(self, stacklevel=2): """ stacklevel is the depth a which this function is called, to issue useful warnings to the user. """ # First check if our function is in the in-memory store. # Using the in-memory store not only makes things faster, but it # also renders us robust to variations of the files when the # in-memory version of the code does not vary try: if self.func in _FUNCTION_HASHES: # We use as an identifier the id of the function and its # hash. This is more likely to falsely change than have hash # collisions, thus we are on the safe side. func_hash = self._hash_func() if func_hash == _FUNCTION_HASHES[self.func]: return True except TypeError: # Some callables are not hashable pass # Here, we go through some effort to be robust to dynamically # changing code and collision. We cannot inspect.getsource # because it is not reliable when using IPython's magic "%run". func_code, source_file, first_line = get_func_code(self.func) func_dir = self._get_func_dir() func_code_file = os.path.join(func_dir, 'func_code.py') try: with io.open(func_code_file, encoding="UTF-8") as infile: old_func_code, old_first_line = \ extract_first_line(infile.read()) except IOError: self._write_func_code(func_code_file, func_code, first_line) return False if old_func_code == func_code: return True # We have differing code, is this because we are referring to # different functions, or because the function we are referring to has # changed? _, func_name = get_func_name(self.func, resolv_alias=False, win_characters=False) if old_first_line == first_line == -1 or func_name == '': if not first_line == -1: func_description = '%s (%s:%i)' % (func_name, source_file, first_line) else: func_description = func_name warnings.warn(JobLibCollisionWarning( "Cannot detect name collisions for function '%s'" % func_description), stacklevel=stacklevel) # Fetch the code at the old location and compare it. If it is the # same than the code store, we have a collision: the code in the # file has not changed, but the name we have is pointing to a new # code block. if not old_first_line == first_line and source_file is not None: possible_collision = False if os.path.exists(source_file): _, func_name = get_func_name(self.func, resolv_alias=False) num_lines = len(func_code.split('\n')) with open_py_source(source_file) as f: on_disk_func_code = f.readlines()[ old_first_line - 1:old_first_line - 1 + num_lines - 1] on_disk_func_code = ''.join(on_disk_func_code) possible_collision = (on_disk_func_code.rstrip() == old_func_code.rstrip()) else: possible_collision = source_file.startswith(' 10: _, func_name = get_func_name(self.func, resolv_alias=False) self.warn("Function %s (stored in %s) has changed." % (func_name, func_dir)) self.clear(warn=True) return False def clear(self, warn=True): """ Empty the function's cache. """ func_dir = self._get_func_dir(mkdir=False) if self._verbose > 0 and warn: self.warn("Clearing cache %s" % func_dir) if os.path.exists(func_dir): shutil.rmtree(func_dir, ignore_errors=True) mkdirp(func_dir) func_code, _, first_line = get_func_code(self.func) func_code_file = os.path.join(func_dir, 'func_code.py') self._write_func_code(func_code_file, func_code, first_line) def call(self, *args, **kwargs): """ Force the execution of the function with the given arguments and persist the output values. """ start_time = time.time() output_dir, _ = self._get_output_dir(*args, **kwargs) if self._verbose > 0: print(format_call(self.func, args, kwargs)) output = self.func(*args, **kwargs) self._persist_output(output, output_dir) duration = time.time() - start_time metadata = self._persist_input(output_dir, duration, args, kwargs) if self._verbose > 0: _, name = get_func_name(self.func) msg = '%s - %s' % (name, format_time(duration)) print(max(0, (80 - len(msg))) * '_' + msg) return output, metadata # Make public def _persist_output(self, output, dir): """ Persist the given output tuple in the directory. """ try: mkdirp(dir) filename = os.path.join(dir, 'output.pkl') numpy_pickle.dump(output, filename, compress=self.compress) if self._verbose > 10: print('Persisting in %s' % dir) except OSError: " Race condition in the creation of the directory " def _persist_input(self, output_dir, duration, args, kwargs, this_duration_limit=0.5): """ Save a small summary of the call using json format in the output directory. output_dir: string directory where to write metadata. duration: float time taken by hashing input arguments, calling the wrapped function and persisting its output. args, kwargs: list and dict input arguments for wrapped function this_duration_limit: float Max execution time for this function before issuing a warning. """ start_time = time.time() argument_dict = filter_args(self.func, self.ignore, args, kwargs) input_repr = dict((k, repr(v)) for k, v in argument_dict.items()) # This can fail due to race-conditions with multiple # concurrent joblibs removing the file or the directory metadata = {"duration": duration, "input_args": input_repr} try: mkdirp(output_dir) with open(os.path.join(output_dir, 'metadata.json'), 'w') as f: json.dump(metadata, f) except: pass this_duration = time.time() - start_time if this_duration > this_duration_limit: # This persistence should be fast. It will not be if repr() takes # time and its output is large, because json.dump will have to # write a large file. This should not be an issue with numpy arrays # for which repr() always output a short representation, but can # be with complex dictionaries. Fixing the problem should be a # matter of replacing repr() above by something smarter. warnings.warn("Persisting input arguments took %.2fs to run.\n" "If this happens often in your code, it can cause " "performance problems \n" "(results will be correct in all cases). \n" "The reason for this is probably some large input " "arguments for a wrapped\n" " function (e.g. large strings).\n" "THIS IS A JOBLIB ISSUE. If you can, kindly provide " "the joblib's team with an\n" " example so that they can fix the problem." % this_duration, stacklevel=5) return metadata def load_output(self, output_dir): """ Read the results of a previous calculation from the directory it was cached in. """ warnings.warn("MemorizedFunc.load_output is deprecated and will be " "removed in a future version\n" "of joblib. A MemorizedResult provides similar features", DeprecationWarning) # No metadata available here. return _load_output(output_dir, _get_func_fullname(self.func), timestamp=self.timestamp, mmap_mode=self.mmap_mode, verbose=self._verbose) # XXX: Need a method to check if results are available. #------------------------------------------------------------------------- # Private `object` interface #------------------------------------------------------------------------- def __repr__(self): return '%s(func=%s, cachedir=%s)' % ( self.__class__.__name__, self.func, repr(self.cachedir), ) ############################################################################### # class `Memory` ############################################################################### class Memory(Logger): """ A context object for caching a function's return value each time it is called with the same input arguments. All values are cached on the filesystem, in a deep directory structure. see :ref:`memory_reference` """ #------------------------------------------------------------------------- # Public interface #------------------------------------------------------------------------- def __init__(self, cachedir, mmap_mode=None, compress=False, verbose=1): """ Parameters ---------- cachedir: string or None The path of the base directory to use as a data store or None. If None is given, no caching is done and the Memory object is completely transparent. mmap_mode: {None, 'r+', 'r', 'w+', 'c'}, optional The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the arguments. compress: boolean, or integer Whether to zip the stored data on disk. If an integer is given, it should be between 1 and 9, and sets the amount of compression. Note that compressed arrays cannot be read by memmapping. verbose: int, optional Verbosity flag, controls the debug messages that are issued as functions are evaluated. """ # XXX: Bad explanation of the None value of cachedir Logger.__init__(self) self._verbose = verbose self.mmap_mode = mmap_mode self.timestamp = time.time() self.compress = compress if compress and mmap_mode is not None: warnings.warn('Compressed results cannot be memmapped', stacklevel=2) if cachedir is None: self.cachedir = None else: self.cachedir = os.path.join(cachedir, 'joblib') mkdirp(self.cachedir) def cache(self, func=None, ignore=None, verbose=None, mmap_mode=False): """ Decorates the given function func to only compute its return value for input arguments not cached on disk. Parameters ---------- func: callable, optional The function to be decorated ignore: list of strings A list of arguments name to ignore in the hashing verbose: integer, optional The verbosity mode of the function. By default that of the memory object is used. mmap_mode: {None, 'r+', 'r', 'w+', 'c'}, optional The memmapping mode used when loading from cache numpy arrays. See numpy.load for the meaning of the arguments. By default that of the memory object is used. Returns ------- decorated_func: MemorizedFunc object The returned object is a MemorizedFunc object, that is callable (behaves like a function), but offers extra methods for cache lookup and management. See the documentation for :class:`joblib.memory.MemorizedFunc`. """ if func is None: # Partial application, to be able to specify extra keyword # arguments in decorators return functools.partial(self.cache, ignore=ignore, verbose=verbose, mmap_mode=mmap_mode) if self.cachedir is None: return NotMemorizedFunc(func) if verbose is None: verbose = self._verbose if mmap_mode is False: mmap_mode = self.mmap_mode if isinstance(func, MemorizedFunc): func = func.func return MemorizedFunc(func, cachedir=self.cachedir, mmap_mode=mmap_mode, ignore=ignore, compress=self.compress, verbose=verbose, timestamp=self.timestamp) def clear(self, warn=True): """ Erase the complete cache directory. """ if warn: self.warn('Flushing completely the cache') if self.cachedir is not None: rm_subdirs(self.cachedir) def eval(self, func, *args, **kwargs): """ Eval function func with arguments `*args` and `**kwargs`, in the context of the memory. This method works similarly to the builtin `apply`, except that the function is called only if the cache is not up to date. """ if self.cachedir is None: return func(*args, **kwargs) return self.cache(func)(*args, **kwargs) #------------------------------------------------------------------------- # Private `object` interface #------------------------------------------------------------------------- def __repr__(self): return '%s(cachedir=%s)' % ( self.__class__.__name__, repr(self.cachedir), ) def __reduce__(self): """ We don't store the timestamp when pickling, to avoid the hash depending from it. In addition, when unpickling, we run the __init__ """ # We need to remove 'joblib' from the end of cachedir cachedir = self.cachedir[:-7] if self.cachedir is not None else None return (self.__class__, (cachedir, self.mmap_mode, self.compress, self._verbose)) joblib-0.9.4/joblib/my_exceptions.py000066400000000000000000000071521264716474700175150ustar00rootroot00000000000000""" Exceptions """ # Author: Gael Varoquaux < gael dot varoquaux at normalesup dot org > # Copyright: 2010, Gael Varoquaux # License: BSD 3 clause import sys from ._compat import PY3_OR_LATER class JoblibException(Exception): """A simple exception with an error message that you can get to.""" def __init__(self, *args): # We need to implement __init__ so that it is picked in the # multiple heritance hierarchy in the class created in # _mk_exception. Note: in Python 2, if you implement __init__ # in your exception class you need to set .args correctly, # otherwise you can dump an exception instance with pickle but # not load it (at load time an empty .args will be passed to # the constructor). Also we want to be explicit and not use # 'super' here. Using 'super' can cause a sibling class method # to be called and we have no control the sibling class method # constructor signature in the exception returned by # _mk_exception. Exception.__init__(self, *args) def __repr__(self): if hasattr(self, 'args') and len(self.args) > 0: message = self.args[0] else: message = '' name = self.__class__.__name__ return '%s\n%s\n%s\n%s' % (name, 75 * '_', message, 75 * '_') __str__ = __repr__ class TransportableException(JoblibException): """An exception containing all the info to wrap an original exception and recreate it. """ def __init__(self, message, etype): # The next line set the .args correctly. This is needed to # make the exception loadable with pickle JoblibException.__init__(self, message, etype) self.message = message self.etype = etype _exception_mapping = dict() def _mk_exception(exception, name=None): # Create an exception inheriting from both JoblibException # and that exception if name is None: name = exception.__name__ this_name = 'Joblib%s' % name if this_name in _exception_mapping: # Avoid creating twice the same exception this_exception = _exception_mapping[this_name] else: if exception is Exception: # JoblibException is already a subclass of Exception. No # need to use multiple inheritance return JoblibException, this_name try: this_exception = type( this_name, (JoblibException, exception), {}) _exception_mapping[this_name] = this_exception except TypeError: # This happens if "Cannot create a consistent method # resolution order", e.g. because 'exception' is a # subclass of JoblibException or 'exception' is not an # acceptable base class this_exception = JoblibException return this_exception, this_name def _mk_common_exceptions(): namespace = dict() if PY3_OR_LATER: import builtins as _builtin_exceptions common_exceptions = filter( lambda x: x.endswith('Error'), dir(_builtin_exceptions)) else: import exceptions as _builtin_exceptions common_exceptions = dir(_builtin_exceptions) for name in common_exceptions: obj = getattr(_builtin_exceptions, name) if isinstance(obj, type) and issubclass(obj, BaseException): this_obj, this_name = _mk_exception(obj, name=name) namespace[this_name] = this_obj return namespace # Updating module locals so that the exceptions pickle right. AFAIK this # works only at module-creation time locals().update(_mk_common_exceptions()) joblib-0.9.4/joblib/numpy_pickle.py000066400000000000000000000420131264716474700173210ustar00rootroot00000000000000""" Utilities for fast persistence of big data, with optional compression. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import pickle import traceback import os import zlib import warnings from io import BytesIO from ._compat import _basestring, PY3_OR_LATER if PY3_OR_LATER: Unpickler = pickle._Unpickler Pickler = pickle._Pickler def asbytes(s): if isinstance(s, bytes): return s return s.encode('latin1') else: Unpickler = pickle.Unpickler Pickler = pickle.Pickler asbytes = str def hex_str(an_int): """Converts an int to an hexadecimal string """ return '{0:#x}'.format(an_int) _MEGA = 2 ** 20 # Compressed pickle header format: _ZFILE_PREFIX followed by _MAX_LEN # bytes which contains the length of the zlib compressed data as an # hexadecimal string. For example: 'ZF0x139 ' _ZFILE_PREFIX = asbytes('ZF') _MAX_LEN = len(hex_str(2 ** 64)) ############################################################################### # Compressed file with Zlib def _read_magic(file_handle): """ Utility to check the magic signature of a file identifying it as a Zfile """ magic = file_handle.read(len(_ZFILE_PREFIX)) # Pickling needs file-handles at the beginning of the file file_handle.seek(0) return magic def read_zfile(file_handle): """Read the z-file and return the content as a string Z-files are raw data compressed with zlib used internally by joblib for persistence. Backward compatibility is not guaranteed. Do not use for external purposes. """ file_handle.seek(0) assert _read_magic(file_handle) == _ZFILE_PREFIX, \ "File does not have the right magic" header_length = len(_ZFILE_PREFIX) + _MAX_LEN length = file_handle.read(header_length) length = length[len(_ZFILE_PREFIX):] length = int(length, 16) # With python2 and joblib version <= 0.8.4 compressed pickle header is one # character wider so we need to ignore an additional space if present. # Note: the first byte of the zlib data is guaranteed not to be a # space according to # https://tools.ietf.org/html/rfc6713#section-2.1 next_byte = file_handle.read(1) if next_byte != b' ': # The zlib compressed data has started and we need to go back # one byte file_handle.seek(header_length) # We use the known length of the data to tell Zlib the size of the # buffer to allocate. data = zlib.decompress(file_handle.read(), 15, length) assert len(data) == length, ( "Incorrect data length while decompressing %s." "The file could be corrupted." % file_handle) return data def write_zfile(file_handle, data, compress=1): """Write the data in the given file as a Z-file. Z-files are raw data compressed with zlib used internally by joblib for persistence. Backward compatibility is not guarantied. Do not use for external purposes. """ file_handle.write(_ZFILE_PREFIX) length = hex_str(len(data)) # Store the length of the data file_handle.write(asbytes(length.ljust(_MAX_LEN))) file_handle.write(zlib.compress(asbytes(data), compress)) ############################################################################### # Utility objects for persistence. class NDArrayWrapper(object): """ An object to be persisted instead of numpy arrays. The only thing this object does, is to carry the filename in which the array has been persisted, and the array subclass. """ def __init__(self, filename, subclass, allow_mmap=True): "Store the useful information for later" self.filename = filename self.subclass = subclass self.allow_mmap = allow_mmap def read(self, unpickler): "Reconstruct the array" filename = os.path.join(unpickler._dirname, self.filename) # Load the array from the disk np_ver = [int(x) for x in unpickler.np.__version__.split('.', 2)[:2]] # use getattr instead of self.allow_mmap to ensure backward compat # with NDArrayWrapper instances pickled with joblib < 0.9.0 allow_mmap = getattr(self, 'allow_mmap', True) memmap_kwargs = ({} if not allow_mmap else {'mmap_mode': unpickler.mmap_mode}) array = unpickler.np.load(filename, **memmap_kwargs) # Reconstruct subclasses. This does not work with old # versions of numpy if (hasattr(array, '__array_prepare__') and not self.subclass in (unpickler.np.ndarray, unpickler.np.memmap)): # We need to reconstruct another subclass new_array = unpickler.np.core.multiarray._reconstruct( self.subclass, (0,), 'b') new_array.__array_prepare__(array) array = new_array return array #def __reduce__(self): # return None class ZNDArrayWrapper(NDArrayWrapper): """An object to be persisted instead of numpy arrays. This object store the Zfile filename in which the data array has been persisted, and the meta information to retrieve it. The reason that we store the raw buffer data of the array and the meta information, rather than array representation routine (tostring) is that it enables us to use completely the strided model to avoid memory copies (a and a.T store as fast). In addition saving the heavy information separately can avoid creating large temporary buffers when unpickling data with large arrays. """ def __init__(self, filename, init_args, state): "Store the useful information for later" self.filename = filename self.state = state self.init_args = init_args def read(self, unpickler): "Reconstruct the array from the meta-information and the z-file" # Here we a simply reproducing the unpickling mechanism for numpy # arrays filename = os.path.join(unpickler._dirname, self.filename) array = unpickler.np.core.multiarray._reconstruct(*self.init_args) with open(filename, 'rb') as f: data = read_zfile(f) state = self.state + (data,) array.__setstate__(state) return array ############################################################################### # Pickler classes class NumpyPickler(Pickler): """A pickler to persist of big data efficiently. The main features of this object are: * persistence of numpy arrays in separate .npy files, for which I/O is fast. * optional compression using Zlib, with a special care on avoid temporaries. """ dispatch = Pickler.dispatch.copy() def __init__(self, filename, compress=0, cache_size=10, protocol=None): self._filename = filename self._filenames = [filename, ] self.cache_size = cache_size self.compress = compress if not self.compress: self.file = open(filename, 'wb') else: self.file = BytesIO() # Count the number of npy files that we have created: self._npy_counter = 1 # By default we want a pickle protocol that only changes with # the major python version and not the minor one if protocol is None: protocol = (pickle.DEFAULT_PROTOCOL if PY3_OR_LATER else pickle.HIGHEST_PROTOCOL) Pickler.__init__(self, self.file, protocol=protocol) # delayed import of numpy, to avoid tight coupling try: import numpy as np except ImportError: np = None self.np = np def _write_array(self, array, filename): if not self.compress: self.np.save(filename, array) allow_mmap = not array.dtype.hasobject container = NDArrayWrapper(os.path.basename(filename), type(array), allow_mmap=allow_mmap) else: filename += '.z' # Efficient compressed storage: # The meta data is stored in the container, and the core # numerics in a z-file _, init_args, state = array.__reduce__() # the last entry of 'state' is the data itself with open(filename, 'wb') as zfile: write_zfile(zfile, state[-1], compress=self.compress) state = state[:-1] container = ZNDArrayWrapper(os.path.basename(filename), init_args, state) return container, filename def save(self, obj): """ Subclass the save method, to save ndarray subclasses in npy files, rather than pickling them. Of course, this is a total abuse of the Pickler class. """ if (self.np is not None and type(obj) in (self.np.ndarray, self.np.matrix, self.np.memmap)): size = obj.size * obj.itemsize if self.compress and size < self.cache_size * _MEGA: # When compressing, as we are not writing directly to the # disk, it is more efficient to use standard pickling if type(obj) is self.np.memmap: # Pickling doesn't work with memmaped arrays obj = self.np.asarray(obj) return Pickler.save(self, obj) if not obj.dtype.hasobject: try: filename = '%s_%02i.npy' % (self._filename, self._npy_counter) # This converts the array in a container obj, filename = self._write_array(obj, filename) self._filenames.append(filename) self._npy_counter += 1 except Exception: # XXX: We should have a logging mechanism print('Failed to save %s to .npy file:\n%s' % ( type(obj), traceback.format_exc())) return Pickler.save(self, obj) def close(self): if self.compress: with open(self._filename, 'wb') as zfile: write_zfile(zfile, self.file.getvalue(), self.compress) class NumpyUnpickler(Unpickler): """A subclass of the Unpickler to unpickle our numpy pickles. """ dispatch = Unpickler.dispatch.copy() def __init__(self, filename, file_handle, mmap_mode=None): self._filename = os.path.basename(filename) self._dirname = os.path.dirname(filename) self.mmap_mode = mmap_mode self.file_handle = self._open_pickle(file_handle) Unpickler.__init__(self, self.file_handle) try: import numpy as np except ImportError: np = None self.np = np def _open_pickle(self, file_handle): return file_handle def load_build(self): """ This method is called to set the state of a newly created object. We capture it to replace our place-holder objects, NDArrayWrapper, by the array we are interested in. We replace them directly in the stack of pickler. """ Unpickler.load_build(self) if isinstance(self.stack[-1], NDArrayWrapper): if self.np is None: raise ImportError('Trying to unpickle an ndarray, ' "but numpy didn't import correctly") nd_array_wrapper = self.stack.pop() array = nd_array_wrapper.read(self) self.stack.append(array) # Be careful to register our new method. if PY3_OR_LATER: dispatch[pickle.BUILD[0]] = load_build else: dispatch[pickle.BUILD] = load_build class ZipNumpyUnpickler(NumpyUnpickler): """A subclass of our Unpickler to unpickle on the fly from compressed storage.""" def __init__(self, filename, file_handle): NumpyUnpickler.__init__(self, filename, file_handle, mmap_mode=None) def _open_pickle(self, file_handle): return BytesIO(read_zfile(file_handle)) ############################################################################### # Utility functions def dump(value, filename, compress=0, cache_size=100, protocol=None): """Fast persistence of an arbitrary Python object into one or multiple files, with dedicated storage for numpy arrays. Parameters ----------- value: any Python object The object to store to disk filename: string The name of the file in which it is to be stored compress: integer for 0 to 9, optional Optional compression level for the data. 0 is no compression. Higher means more compression, but also slower read and write times. Using a value of 3 is often a good compromise. See the notes for more details. cache_size: positive number, optional Fixes the order of magnitude (in megabytes) of the cache used for in-memory compression. Note that this is just an order of magnitude estimate and that for big arrays, the code will go over this value at dump and at load time. protocol: positive int Pickle protocol, see pickle.dump documentation for more details. Returns ------- filenames: list of strings The list of file names in which the data is stored. If compress is false, each array is stored in a different file. See Also -------- joblib.load : corresponding loader Notes ----- Memmapping on load cannot be used for compressed files. Thus using compression can significantly slow down loading. In addition, compressed files take extra extra memory during dump and load. """ if compress is True: # By default, if compress is enabled, we want to be using 3 by # default compress = 3 if not isinstance(filename, _basestring): # People keep inverting arguments, and the resulting error is # incomprehensible raise ValueError( 'Second argument should be a filename, %s (type %s) was given' % (filename, type(filename)) ) try: pickler = NumpyPickler(filename, compress=compress, cache_size=cache_size, protocol=protocol) pickler.dump(value) pickler.close() finally: if 'pickler' in locals() and hasattr(pickler, 'file'): pickler.file.flush() pickler.file.close() return pickler._filenames def load(filename, mmap_mode=None): """Reconstruct a Python object from a file persisted with joblib.dump. Parameters ----------- filename: string The name of the file from which to load the object mmap_mode: {None, 'r+', 'r', 'w+', 'c'}, optional If not None, the arrays are memory-mapped from the disk. This mode has no effect for compressed files. Note that in this case the reconstructed object might not longer match exactly the originally pickled object. Returns ------- result: any Python object The object stored in the file. See Also -------- joblib.dump : function to save an object Notes ----- This function can load numpy array files saved separately during the dump. If the mmap_mode argument is given, it is passed to np.load and arrays are loaded as memmaps. As a consequence, the reconstructed object might not match the original pickled object. Note that if the file was saved with compression, the arrays cannot be memmaped. """ with open(filename, 'rb') as file_handle: # We are careful to open the file handle early and keep it open to # avoid race-conditions on renames. That said, if data are stored in # companion files, moving the directory will create a race when # joblib tries to access the companion files. if _read_magic(file_handle) == _ZFILE_PREFIX: if mmap_mode is not None: warnings.warn('file "%(filename)s" appears to be a zip, ' 'ignoring mmap_mode "%(mmap_mode)s" flag passed' % locals(), Warning, stacklevel=2) unpickler = ZipNumpyUnpickler(filename, file_handle=file_handle) else: unpickler = NumpyUnpickler(filename, file_handle=file_handle, mmap_mode=mmap_mode) try: obj = unpickler.load() except UnicodeDecodeError as exc: # More user-friendly error message if PY3_OR_LATER: new_exc = ValueError( 'You may be trying to read with ' 'python 3 a joblib pickle generated with python 2. ' 'This feature is not supported by joblib.') new_exc.__cause__ = exc raise new_exc finally: if hasattr(unpickler, 'file_handle'): unpickler.file_handle.close() return obj joblib-0.9.4/joblib/parallel.py000066400000000000000000001053671264716474700164320ustar00rootroot00000000000000""" Helpers for embarrassingly parallel code. """ # Author: Gael Varoquaux < gael dot varoquaux at normalesup dot org > # Copyright: 2010, Gael Varoquaux # License: BSD 3 clause from __future__ import division import os import sys import gc import warnings from math import sqrt import functools import time import threading import itertools from numbers import Integral try: import cPickle as pickle except: import pickle from ._multiprocessing_helpers import mp if mp is not None: from .pool import MemmapingPool from multiprocessing.pool import ThreadPool from .format_stack import format_exc, format_outer_frames from .logger import Logger, short_format_time from .my_exceptions import TransportableException, _mk_exception from .disk import memstr_to_kbytes from ._compat import _basestring VALID_BACKENDS = ['multiprocessing', 'threading'] # Environment variables to protect against bad situations when nesting JOBLIB_SPAWNED_PROCESS = "__JOBLIB_SPAWNED_PARALLEL__" # In seconds, should be big enough to hide multiprocessing dispatching # overhead. # This settings was found by running benchmarks/bench_auto_batching.py # with various parameters on various platforms. MIN_IDEAL_BATCH_DURATION = .2 # Should not be too high to avoid stragglers: long jobs running alone # on a single worker while other workers have no work to process any more. MAX_IDEAL_BATCH_DURATION = 2 # Under Linux or OS X the default start method of multiprocessing # can cause third party libraries to crash. Under Python 3.4+ it is possible # to set an environment variable to switch the default start method from # 'fork' to 'forkserver' or 'spawn' to avoid this issue albeit at the cost # of causing semantic changes and some additional pool instanciation overhead. if hasattr(mp, 'get_context'): method = os.environ.get('JOBLIB_START_METHOD', '').strip() or None DEFAULT_MP_CONTEXT = mp.get_context(method=method) else: DEFAULT_MP_CONTEXT = None class BatchedCalls(object): """Wrap a sequence of (func, args, kwargs) tuples as a single callable""" def __init__(self, iterator_slice): self.items = list(iterator_slice) self._size = len(self.items) def __call__(self): return [func(*args, **kwargs) for func, args, kwargs in self.items] def __len__(self): return self._size ############################################################################### # CPU count that works also when multiprocessing has been disabled via # the JOBLIB_MULTIPROCESSING environment variable def cpu_count(): """ Return the number of CPUs. """ if mp is None: return 1 return mp.cpu_count() ############################################################################### # For verbosity def _verbosity_filter(index, verbose): """ Returns False for indices increasingly apart, the distance depending on the value of verbose. We use a lag increasing as the square of index """ if not verbose: return True elif verbose > 10: return False if index == 0: return False verbose = .5 * (11 - verbose) ** 2 scale = sqrt(index / verbose) next_scale = sqrt((index + 1) / verbose) return (int(next_scale) == int(scale)) ############################################################################### class WorkerInterrupt(Exception): """ An exception that is not KeyboardInterrupt to allow subprocesses to be interrupted. """ pass ############################################################################### class SafeFunction(object): """ Wraps a function to make it exception with full traceback in their representation. Useful for parallel computing with multiprocessing, for which exceptions cannot be captured. """ def __init__(self, func): self.func = func def __call__(self, *args, **kwargs): try: return self.func(*args, **kwargs) except KeyboardInterrupt: # We capture the KeyboardInterrupt and reraise it as # something different, as multiprocessing does not # interrupt processing for a KeyboardInterrupt raise WorkerInterrupt() except: e_type, e_value, e_tb = sys.exc_info() text = format_exc(e_type, e_value, e_tb, context=10, tb_offset=1) raise TransportableException(text, e_type) ############################################################################### def delayed(function, check_pickle=True): """Decorator used to capture the arguments of a function. Pass `check_pickle=False` when: - performing a possibly repeated check is too costly and has been done already once outside of the call to delayed. - when used in conjunction `Parallel(backend='threading')`. """ # Try to pickle the input function, to catch the problems early when # using with multiprocessing: if check_pickle: pickle.dumps(function) def delayed_function(*args, **kwargs): return function, args, kwargs try: delayed_function = functools.wraps(function)(delayed_function) except AttributeError: " functools.wraps fails on some callable objects " return delayed_function ############################################################################### class ImmediateComputeBatch(object): """Sequential computation of a batch of tasks. This replicates the async computation API but actually does not delay the computations when joblib.Parallel runs in sequential mode. """ def __init__(self, batch): # Don't delay the application, to avoid keeping the input # arguments in memory self.results = batch() def get(self): return self.results ############################################################################### class BatchCompletionCallBack(object): """Callback used by joblib.Parallel's multiprocessing backend. This callable is executed by the parent process whenever a worker process has returned the results of a batch of tasks. It is used for progress reporting, to update estimate of the batch processing duration and to schedule the next batch of tasks to be processed. """ def __init__(self, dispatch_timestamp, batch_size, parallel): self.dispatch_timestamp = dispatch_timestamp self.batch_size = batch_size self.parallel = parallel def __call__(self, out): self.parallel.n_completed_tasks += self.batch_size this_batch_duration = time.time() - self.dispatch_timestamp if (self.parallel.batch_size == 'auto' and self.batch_size == self.parallel._effective_batch_size): # Update the smoothed streaming estimate of the duration of a batch # from dispatch to completion old_duration = self.parallel._smoothed_batch_duration if old_duration == 0: # First record of duration for this batch size after the last # reset. new_duration = this_batch_duration else: # Update the exponentially weighted average of the duration of # batch for the current effective size. new_duration = 0.8 * old_duration + 0.2 * this_batch_duration self.parallel._smoothed_batch_duration = new_duration self.parallel.print_progress() if self.parallel._original_iterator is not None: self.parallel.dispatch_next() ############################################################################### class Parallel(Logger): ''' Helper class for readable parallel mapping. Parameters ----------- n_jobs: int, default: 1 The maximum number of concurrently running jobs, such as the number of Python worker processes when backend="multiprocessing" or the size of the thread-pool when backend="threading". If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. backend: str or None, default: 'multiprocessing' Specify the parallelization backend implementation. Supported backends are: - "multiprocessing" used by default, can induce some communication and memory overhead when exchanging input and output data with the with the worker Python processes. - "threading" is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. "threading" is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a "with nogil" block or an expensive call to a library such as NumPy). verbose: int, optional The verbosity level: if non zero, progress messages are printed. Above 50, the output is sent to stdout. The frequency of the messages increases with the verbosity level. If it more than 10, all iterations are reported. pre_dispatch: {'all', integer, or expression, as in '3*n_jobs'} The number of batches (of tasks) to be pre-dispatched. Default is '2*n_jobs'. When batch_size="auto" this is reasonable default and the multiprocessing workers shoud never starve. batch_size: int or 'auto', default: 'auto' The number of atomic tasks to dispatch at once to each worker. When individual evaluations are very fast, multiprocessing can be slower than sequential computation because of the overhead. Batching fast computations together can mitigate this. The ``'auto'`` strategy keeps track of the time it takes for a batch to complete, and dynamically adjusts the batch size to keep the time on the order of half a second, using a heuristic. The initial batch size is 1. ``batch_size="auto"`` with ``backend="threading"`` will dispatch batches of a single task at a time as the threading backend has very little overhead and using larger batch size has not proved to bring any gain in that case. temp_folder: str, optional Folder to be used by the pool for memmaping large arrays for sharing memory with worker processes. If None, this will try in order: - a folder pointed by the JOBLIB_TEMP_FOLDER environment variable, - /dev/shm if the folder exists and is writable: this is a RAMdisk filesystem available by default on modern Linux distributions, - the default system temporary folder that can be overridden with TMP, TMPDIR or TEMP environment variables, typically /tmp under Unix operating systems. Only active when backend="multiprocessing". max_nbytes int, str, or None, optional, 1M by default Threshold on the size of arrays passed to the workers that triggers automated memory mapping in temp_folder. Can be an int in Bytes, or a human-readable string, e.g., '1M' for 1 megabyte. Use None to disable memmaping of large arrays. Only active when backend="multiprocessing". Notes ----- This object uses the multiprocessing module to compute in parallel the application of a function to many different arguments. The main functionality it brings in addition to using the raw multiprocessing API are (see examples for details): * More readable code, in particular since it avoids constructing list of arguments. * Easier debugging: - informative tracebacks even when the error happens on the client side - using 'n_jobs=1' enables to turn off parallel computing for debugging without changing the codepath - early capture of pickling errors * An optional progress meter. * Interruption of multiprocesses jobs with 'Ctrl-C' * Flexible pickling control for the communication to and from the worker processes. * Ability to use shared memory efficiently with worker processes for large numpy-based datastructures. Examples -------- A simple example: >>> from math import sqrt >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=1)(delayed(sqrt)(i**2) for i in range(10)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0] Reshaping the output when the function has several return values: >>> from math import modf >>> from joblib import Parallel, delayed >>> r = Parallel(n_jobs=1)(delayed(modf)(i/2.) for i in range(10)) >>> res, i = zip(*r) >>> res (0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5, 0.0, 0.5) >>> i (0.0, 0.0, 1.0, 1.0, 2.0, 2.0, 3.0, 3.0, 4.0, 4.0) The progress meter: the higher the value of `verbose`, the more messages:: >>> from time import sleep >>> from joblib import Parallel, delayed >>> r = Parallel(n_jobs=2, verbose=5)(delayed(sleep)(.1) for _ in range(10)) #doctest: +SKIP [Parallel(n_jobs=2)]: Done 1 out of 10 | elapsed: 0.1s remaining: 0.9s [Parallel(n_jobs=2)]: Done 3 out of 10 | elapsed: 0.2s remaining: 0.5s [Parallel(n_jobs=2)]: Done 6 out of 10 | elapsed: 0.3s remaining: 0.2s [Parallel(n_jobs=2)]: Done 9 out of 10 | elapsed: 0.5s remaining: 0.1s [Parallel(n_jobs=2)]: Done 10 out of 10 | elapsed: 0.5s finished Traceback example, note how the line of the error is indicated as well as the values of the parameter passed to the function that triggered the exception, even though the traceback happens in the child process:: >>> from heapq import nlargest >>> from joblib import Parallel, delayed >>> Parallel(n_jobs=2)(delayed(nlargest)(2, n) for n in (range(4), 'abcde', 3)) #doctest: +SKIP #... --------------------------------------------------------------------------- Sub-process traceback: --------------------------------------------------------------------------- TypeError Mon Nov 12 11:37:46 2012 PID: 12934 Python 2.7.3: /usr/bin/python ........................................................................... /usr/lib/python2.7/heapq.pyc in nlargest(n=2, iterable=3, key=None) 419 if n >= size: 420 return sorted(iterable, key=key, reverse=True)[:n] 421 422 # When key is none, use simpler decoration 423 if key is None: --> 424 it = izip(iterable, count(0,-1)) # decorate 425 result = _nlargest(n, it) 426 return map(itemgetter(0), result) # undecorate 427 428 # General case, slowest method TypeError: izip argument #1 must support iteration ___________________________________________________________________________ Using pre_dispatch in a producer/consumer situation, where the data is generated on the fly. Note how the producer is first called a 3 times before the parallel loop is initiated, and then called to generate new data on the fly. In this case the total number of iterations cannot be reported in the progress messages:: >>> from math import sqrt >>> from joblib import Parallel, delayed >>> def producer(): ... for i in range(6): ... print('Produced %s' % i) ... yield i >>> out = Parallel(n_jobs=2, verbose=100, pre_dispatch='1.5*n_jobs')( ... delayed(sqrt)(i) for i in producer()) #doctest: +SKIP Produced 0 Produced 1 Produced 2 [Parallel(n_jobs=2)]: Done 1 jobs | elapsed: 0.0s Produced 3 [Parallel(n_jobs=2)]: Done 2 jobs | elapsed: 0.0s Produced 4 [Parallel(n_jobs=2)]: Done 3 jobs | elapsed: 0.0s Produced 5 [Parallel(n_jobs=2)]: Done 4 jobs | elapsed: 0.0s [Parallel(n_jobs=2)]: Done 5 out of 6 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=2)]: Done 6 out of 6 | elapsed: 0.0s finished ''' def __init__(self, n_jobs=1, backend='multiprocessing', verbose=0, pre_dispatch='2 * n_jobs', batch_size='auto', temp_folder=None, max_nbytes='1M', mmap_mode='r'): self.verbose = verbose self._mp_context = DEFAULT_MP_CONTEXT if backend is None: # `backend=None` was supported in 0.8.2 with this effect backend = "multiprocessing" elif hasattr(backend, 'Pool') and hasattr(backend, 'Lock'): # Make it possible to pass a custom multiprocessing context as # backend to change the start method to forkserver or spawn or # preload modules on the forkserver helper process. self._mp_context = backend backend = "multiprocessing" if backend not in VALID_BACKENDS: raise ValueError("Invalid backend: %s, expected one of %r" % (backend, VALID_BACKENDS)) self.backend = backend self.n_jobs = n_jobs if (batch_size == 'auto' or isinstance(batch_size, Integral) and batch_size > 0): self.batch_size = batch_size else: raise ValueError( "batch_size must be 'auto' or a positive integer, got: %r" % batch_size) self.pre_dispatch = pre_dispatch self._temp_folder = temp_folder if isinstance(max_nbytes, _basestring): self._max_nbytes = 1024 * memstr_to_kbytes(max_nbytes) else: self._max_nbytes = max_nbytes self._mmap_mode = mmap_mode # Not starting the pool in the __init__ is a design decision, to be # able to close it ASAP, and not burden the user with closing it # unless they choose to use the context manager API with a with block. self._pool = None self._output = None self._jobs = list() self._managed_pool = False # This lock is used coordinate the main thread of this process with # the async callback thread of our the pool. self._lock = threading.Lock() def __enter__(self): self._managed_pool = True self._initialize_pool() return self def __exit__(self, exc_type, exc_value, traceback): self._terminate_pool() self._managed_pool = False def _effective_n_jobs(self): n_jobs = self.n_jobs if n_jobs == 0: raise ValueError('n_jobs == 0 in Parallel has no meaning') elif mp is None or n_jobs is None: # multiprocessing is not available or disabled, fallback # to sequential mode return 1 elif n_jobs < 0: n_jobs = max(mp.cpu_count() + 1 + n_jobs, 1) return n_jobs def _initialize_pool(self): """Build a process or thread pool and return the number of workers""" n_jobs = self._effective_n_jobs() # The list of exceptions that we will capture self.exceptions = [TransportableException] if n_jobs == 1: # Sequential mode: do not use a pool instance to avoid any # useless dispatching overhead self._pool = None elif self.backend == 'threading': self._pool = ThreadPool(n_jobs) elif self.backend == 'multiprocessing': if mp.current_process().daemon: # Daemonic processes cannot have children self._pool = None warnings.warn( 'Multiprocessing-backed parallel loops cannot be nested,' ' setting n_jobs=1', stacklevel=3) return 1 elif threading.current_thread().name != 'MainThread': # Prevent posix fork inside in non-main posix threads self._pool = None warnings.warn( 'Multiprocessing backed parallel loops cannot be nested' ' below threads, setting n_jobs=1', stacklevel=3) return 1 else: already_forked = int(os.environ.get(JOBLIB_SPAWNED_PROCESS, 0)) if already_forked: raise ImportError('[joblib] Attempting to do parallel computing ' 'without protecting your import on a system that does ' 'not support forking. To use parallel-computing in a ' 'script, you must protect your main loop using "if ' "__name__ == '__main__'" '". Please see the joblib documentation on Parallel ' 'for more information' ) # Set an environment variable to avoid infinite loops os.environ[JOBLIB_SPAWNED_PROCESS] = '1' # Make sure to free as much memory as possible before forking gc.collect() poolargs = dict( max_nbytes=self._max_nbytes, mmap_mode=self._mmap_mode, temp_folder=self._temp_folder, verbose=max(0, self.verbose - 50), ) if self._mp_context is not None: # Use Python 3.4+ multiprocessing context isolation poolargs['context'] = self._mp_context self._pool = MemmapingPool(n_jobs, **poolargs) # We are using multiprocessing, we also want to capture # KeyboardInterrupts self.exceptions.extend([KeyboardInterrupt, WorkerInterrupt]) else: raise ValueError("Unsupported backend: %s" % self.backend) return n_jobs def _terminate_pool(self): if self._pool is not None: self._pool.close() self._pool.terminate() # terminate does a join() self._pool = None if self.backend == 'multiprocessing': os.environ.pop(JOBLIB_SPAWNED_PROCESS, 0) def _dispatch(self, batch): """Queue the batch for computing, with or without multiprocessing WARNING: this method is not thread-safe: it should be only called indirectly via dispatch_one_batch. """ # If job.get() catches an exception, it closes the queue: if self._aborting: return if self._pool is None: job = ImmediateComputeBatch(batch) self._jobs.append(job) self.n_dispatched_batches += 1 self.n_dispatched_tasks += len(batch) self.n_completed_tasks += len(batch) if not _verbosity_filter(self.n_dispatched_batches, self.verbose): self._print('Done %3i tasks | elapsed: %s', (self.n_completed_tasks, short_format_time(time.time() - self._start_time) )) else: dispatch_timestamp = time.time() cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self) job = self._pool.apply_async(SafeFunction(batch), callback=cb) self._jobs.append(job) self.n_dispatched_tasks += len(batch) self.n_dispatched_batches += 1 def dispatch_next(self): """Dispatch more data for parallel processing This method is meant to be called concurrently by the multiprocessing callback. We rely on the thread-safety of dispatch_one_batch to protect against concurrent consumption of the unprotected iterator. """ if not self.dispatch_one_batch(self._original_iterator): self._iterating = False self._original_iterator = None def dispatch_one_batch(self, iterator): """Prefetch the tasks for the next batch and dispatch them. The effective size of the batch is computed here. If there are no more jobs to dispatch, return False, else return True. The iterator consumption and dispatching is protected by the same lock so calling this function should be thread safe. """ if self.batch_size == 'auto' and self.backend == 'threading': # Batching is never beneficial with the threading backend batch_size = 1 elif self.batch_size == 'auto': old_batch_size = self._effective_batch_size batch_duration = self._smoothed_batch_duration if (batch_duration > 0 and batch_duration < MIN_IDEAL_BATCH_DURATION): # The current batch size is too small: the duration of the # processing of a batch of task is not large enough to hide # the scheduling overhead. ideal_batch_size = int( old_batch_size * MIN_IDEAL_BATCH_DURATION / batch_duration) # Multiply by two to limit oscilations between min and max. batch_size = max(2 * ideal_batch_size, 1) self._effective_batch_size = batch_size if self.verbose >= 10: self._print("Batch computation too fast (%.4fs.) " "Setting batch_size=%d.", ( batch_duration, batch_size)) elif (batch_duration > MAX_IDEAL_BATCH_DURATION and old_batch_size >= 2): # The current batch size is too big. If we schedule overly long # running batches some CPUs might wait with nothing left to do # while a couple of CPUs a left processing a few long running # batches. Better reduce the batch size a bit to limit the # likelihood of scheduling such stragglers. self._effective_batch_size = batch_size = old_batch_size // 2 if self.verbose >= 10: self._print("Batch computation too slow (%.2fs.) " "Setting batch_size=%d.", ( batch_duration, batch_size)) else: # No batch size adjustment batch_size = old_batch_size if batch_size != old_batch_size: # Reset estimation of the smoothed mean batch duration: this # estimate is updated in the multiprocessing apply_async # CallBack as long as the batch_size is constant. Therefore # we need to reset the estimate whenever we re-tune the batch # size. self._smoothed_batch_duration = 0 else: # Fixed batch size strategy batch_size = self.batch_size with self._lock: tasks = BatchedCalls(itertools.islice(iterator, batch_size)) if not tasks: # No more tasks available in the iterator: tell caller to stop. return False else: self._dispatch(tasks) return True def _print(self, msg, msg_args): """Display the message on stout or stderr depending on verbosity""" # XXX: Not using the logger framework: need to # learn to use logger better. if not self.verbose: return if self.verbose < 50: writer = sys.stderr.write else: writer = sys.stdout.write msg = msg % msg_args writer('[%s]: %s\n' % (self, msg)) def print_progress(self): """Display the process of the parallel execution only a fraction of time, controlled by self.verbose. """ if not self.verbose: return elapsed_time = time.time() - self._start_time # This is heuristic code to print only 'verbose' times a messages # The challenge is that we may not know the queue length if self._original_iterator: if _verbosity_filter(self.n_dispatched_batches, self.verbose): return self._print('Done %3i tasks | elapsed: %s', (self.n_completed_tasks, short_format_time(elapsed_time), )) else: index = self.n_dispatched_batches # We are finished dispatching total_tasks = self.n_dispatched_tasks # We always display the first loop if not index == 0: # Display depending on the number of remaining items # A message as soon as we finish dispatching, cursor is 0 cursor = (total_tasks - index + 1 - self._pre_dispatch_amount) frequency = (total_tasks // self.verbose) + 1 is_last_item = (index + 1 == total_tasks) if (is_last_item or cursor % frequency): return remaining_time = (elapsed_time / (index + 1) * (self.n_dispatched_tasks - index - 1.)) self._print('Done %3i out of %3i | elapsed: %s remaining: %s', (index + 1, total_tasks, short_format_time(elapsed_time), short_format_time(remaining_time), )) def retrieve(self): self._output = list() while self._iterating or len(self._jobs) > 0: if len(self._jobs) == 0: # Wait for an async callback to dispatch new jobs time.sleep(0.01) continue # We need to be careful: the job list can be filling up as # we empty it and Python list are not thread-safe by default hence # the use of the lock with self._lock: job = self._jobs.pop(0) try: self._output.extend(job.get()) except tuple(self.exceptions) as exception: # Stop dispatching any new job in the async callback thread self._aborting = True if isinstance(exception, TransportableException): # Capture exception to add information on the local # stack in addition to the distant stack this_report = format_outer_frames(context=10, stack_start=1) report = """Multiprocessing exception: %s --------------------------------------------------------------------------- Sub-process traceback: --------------------------------------------------------------------------- %s""" % (this_report, exception.message) # Convert this to a JoblibException exception_type = _mk_exception(exception.etype)[0] exception = exception_type(report) # Kill remaining running processes without waiting for # the results as we will raise the exception we got back # to the caller instead of returning any result. self._terminate_pool() if self._managed_pool: # In case we had to terminate a managed pool, let # us start a new one to ensure that subsequent calls # to __call__ on the same Parallel instance will get # a working pool as they expect. self._initialize_pool() raise exception def __call__(self, iterable): if self._jobs: raise ValueError('This Parallel instance is already running') # A flag used to abort the dispatching of jobs in case an # exception is found self._aborting = False if not self._managed_pool: n_jobs = self._initialize_pool() else: n_jobs = self._effective_n_jobs() if self.batch_size == 'auto': self._effective_batch_size = 1 iterator = iter(iterable) pre_dispatch = self.pre_dispatch if pre_dispatch == 'all' or n_jobs == 1: # prevent further dispatch via multiprocessing callback thread self._original_iterator = None self._pre_dispatch_amount = 0 else: self._original_iterator = iterator if hasattr(pre_dispatch, 'endswith'): pre_dispatch = eval(pre_dispatch) self._pre_dispatch_amount = pre_dispatch = int(pre_dispatch) # The main thread will consume the first pre_dispatch items and # the remaining items will later be lazily dispatched by async # callbacks upon task completions. iterator = itertools.islice(iterator, pre_dispatch) self._start_time = time.time() self.n_dispatched_batches = 0 self.n_dispatched_tasks = 0 self.n_completed_tasks = 0 self._smoothed_batch_duration = 0.0 try: # Only set self._iterating to True if at least a batch # was dispatched. In particular this covers the edge # case of Parallel used with an exhausted iterator. while self.dispatch_one_batch(iterator): self._iterating = True else: self._iterating = False if pre_dispatch == "all" or n_jobs == 1: # The iterable was consumed all at once by the above for loop. # No need to wait for async callbacks to trigger to # consumption. self._iterating = False self.retrieve() # Make sure that we get a last message telling us we are done elapsed_time = time.time() - self._start_time self._print('Done %3i out of %3i | elapsed: %s finished', (len(self._output), len(self._output), short_format_time(elapsed_time))) finally: if not self._managed_pool: self._terminate_pool() self._jobs = list() output = self._output self._output = None return output def __repr__(self): return '%s(n_jobs=%s)' % (self.__class__.__name__, self.n_jobs) joblib-0.9.4/joblib/pool.py000066400000000000000000000557631264716474700156130ustar00rootroot00000000000000"""Custom implementation of multiprocessing.Pool with custom pickler This module provides efficient ways of working with data stored in shared memory with numpy.memmap arrays without inducing any memory copy between the parent and child processes. This module should not be imported if multiprocessing is not available as it implements subclasses of multiprocessing Pool that uses a custom alternative to SimpleQueue. """ # Author: Olivier Grisel # Copyright: 2012, Olivier Grisel # License: BSD 3 clause from mmap import mmap import errno import os import stat import sys import threading import atexit import tempfile import shutil import warnings from time import sleep try: WindowsError except NameError: WindowsError = None try: # Python 2 compat from cPickle import loads from cPickle import dumps except ImportError: from pickle import loads from pickle import dumps import copyreg # Customizable pure Python pickler in Python 2 # customizable C-optimized pickler under Python 3.3+ from pickle import Pickler from pickle import HIGHEST_PROTOCOL from io import BytesIO from ._multiprocessing_helpers import mp, assert_spawning # We need the class definition to derive from it not the multiprocessing.Pool # factory function from multiprocessing.pool import Pool try: import numpy as np from numpy.lib.stride_tricks import as_strided except ImportError: np = None from .numpy_pickle import load from .numpy_pickle import dump from .hashing import hash # Some system have a ramdisk mounted by default, we can use it instead of /tmp # as the default folder to dump big arrays to share with subprocesses SYSTEM_SHARED_MEM_FS = '/dev/shm' # Folder and file permissions to chmod temporary files generated by the # memmaping pool. Only the owner of the Python process can access the # temporary files and folder. FOLDER_PERMISSIONS = stat.S_IRUSR | stat.S_IWUSR | stat.S_IXUSR FILE_PERMISSIONS = stat.S_IRUSR | stat.S_IWUSR ############################################################################### # Support for efficient transient pickling of numpy data structures def _get_backing_memmap(a): """Recursively look up the original np.memmap instance base if any""" b = getattr(a, 'base', None) if b is None: # TODO: check scipy sparse datastructure if scipy is installed # a nor its descendants do not have a memmap base return None elif isinstance(b, mmap): # a is already a real memmap instance. return a else: # Recursive exploration of the base ancestry return _get_backing_memmap(b) def has_shareable_memory(a): """Return True if a is backed by some mmap buffer directly or not""" return _get_backing_memmap(a) is not None def _strided_from_memmap(filename, dtype, mode, offset, order, shape, strides, total_buffer_len): """Reconstruct an array view on a memmory mapped file""" if mode == 'w+': # Do not zero the original data when unpickling mode = 'r+' if strides is None: # Simple, contiguous memmap return np.memmap(filename, dtype=dtype, shape=shape, mode=mode, offset=offset, order=order) else: # For non-contiguous data, memmap the total enclosing buffer and then # extract the non-contiguous view with the stride-tricks API base = np.memmap(filename, dtype=dtype, shape=total_buffer_len, mode=mode, offset=offset, order=order) return as_strided(base, shape=shape, strides=strides) def _reduce_memmap_backed(a, m): """Pickling reduction for memmap backed arrays a is expected to be an instance of np.ndarray (or np.memmap) m is expected to be an instance of np.memmap on the top of the ``base`` attribute ancestry of a. ``m.base`` should be the real python mmap object. """ # offset that comes from the striding differences between a and m a_start, a_end = np.byte_bounds(a) m_start = np.byte_bounds(m)[0] offset = a_start - m_start # offset from the backing memmap offset += m.offset if m.flags['F_CONTIGUOUS']: order = 'F' else: # The backing memmap buffer is necessarily contiguous hence C if not # Fortran order = 'C' if a.flags['F_CONTIGUOUS'] or a.flags['C_CONTIGUOUS']: # If the array is a contiguous view, no need to pass the strides strides = None total_buffer_len = None else: # Compute the total number of items to map from which the strided # view will be extracted. strides = a.strides total_buffer_len = (a_end - a_start) // a.itemsize return (_strided_from_memmap, (m.filename, a.dtype, m.mode, offset, order, a.shape, strides, total_buffer_len)) def reduce_memmap(a): """Pickle the descriptors of a memmap instance to reopen on same file""" m = _get_backing_memmap(a) if m is not None: # m is a real mmap backed memmap instance, reduce a preserving striding # information return _reduce_memmap_backed(a, m) else: # This memmap instance is actually backed by a regular in-memory # buffer: this can happen when using binary operators on numpy.memmap # instances return (loads, (dumps(np.asarray(a), protocol=HIGHEST_PROTOCOL),)) class ArrayMemmapReducer(object): """Reducer callable to dump large arrays to memmap files. Parameters ---------- max_nbytes: int Threshold to trigger memmaping of large arrays to files created a folder. temp_folder: str Path of a folder where files for backing memmaped arrays are created. mmap_mode: 'r', 'r+' or 'c' Mode for the created memmap datastructure. See the documentation of numpy.memmap for more details. Note: 'w+' is coerced to 'r+' automatically to avoid zeroing the data on unpickling. verbose: int, optional, 0 by default If verbose > 0, memmap creations are logged. If verbose > 1, both memmap creations, reuse and array pickling are logged. prewarm: bool, optional, False by default. Force a read on newly memmaped array to make sure that OS pre-cache it memory. This can be useful to avoid concurrent disk access when the same data array is passed to different worker processes. """ def __init__(self, max_nbytes, temp_folder, mmap_mode, verbose=0, context_id=None, prewarm=True): self._max_nbytes = max_nbytes self._temp_folder = temp_folder self._mmap_mode = mmap_mode self.verbose = int(verbose) self._prewarm = prewarm if context_id is not None: warnings.warn('context_id is deprecated and ignored in joblib' ' 0.9.4 and will be removed in 0.11', DeprecationWarning) def __call__(self, a): m = _get_backing_memmap(a) if m is not None: # a is already backed by a memmap file, let's reuse it directly return _reduce_memmap_backed(a, m) if (not a.dtype.hasobject and self._max_nbytes is not None and a.nbytes > self._max_nbytes): # check that the folder exists (lazily create the pool temp folder # if required) try: os.makedirs(self._temp_folder) os.chmod(self._temp_folder, FOLDER_PERMISSIONS) except OSError as e: if e.errno != errno.EEXIST: raise e # Find a unique, concurrent safe filename for writing the # content of this array only once. basename = "%d-%d-%s.pkl" % ( os.getpid(), id(threading.current_thread()), hash(a)) filename = os.path.join(self._temp_folder, basename) # In case the same array with the same content is passed several # times to the pool subprocess children, serialize it only once # XXX: implement an explicit reference counting scheme to make it # possible to delete temporary files as soon as the workers are # done processing this data. if not os.path.exists(filename): if self.verbose > 0: print("Memmaping (shape=%r, dtype=%s) to new file %s" % ( a.shape, a.dtype, filename)) for dumped_filename in dump(a, filename): os.chmod(dumped_filename, FILE_PERMISSIONS) if self._prewarm: # Warm up the data to avoid concurrent disk access in # multiple children processes load(filename, mmap_mode=self._mmap_mode).max() elif self.verbose > 1: print("Memmaping (shape=%s, dtype=%s) to old file %s" % ( a.shape, a.dtype, filename)) # The worker process will use joblib.load to memmap the data return (load, (filename, self._mmap_mode)) else: # do not convert a into memmap, let pickler do its usual copy with # the default system pickler if self.verbose > 1: print("Pickling array (shape=%r, dtype=%s)." % ( a.shape, a.dtype)) return (loads, (dumps(a, protocol=HIGHEST_PROTOCOL),)) ############################################################################### # Enable custom pickling in Pool queues class CustomizablePickler(Pickler): """Pickler that accepts custom reducers. HIGHEST_PROTOCOL is selected by default as this pickler is used to pickle ephemeral datastructures for interprocess communication hence no backward compatibility is required. `reducers` is expected expected to be a dictionary with key/values being `(type, callable)` pairs where `callable` is a function that give an instance of `type` will return a tuple `(constructor, tuple_of_objects)` to rebuild an instance out of the pickled `tuple_of_objects` as would return a `__reduce__` method. See the standard library documentation on pickling for more details. """ # We override the pure Python pickler as its the only way to be able to # customize the dispatch table without side effects in Python 2.6 # to 3.2. For Python 3.3+ leverage the new dispatch_table # feature from http://bugs.python.org/issue14166 that makes it possible # to use the C implementation of the Pickler which is faster. def __init__(self, writer, reducers=None, protocol=HIGHEST_PROTOCOL): Pickler.__init__(self, writer, protocol=protocol) if reducers is None: reducers = {} if hasattr(Pickler, 'dispatch'): # Make the dispatch registry an instance level attribute instead of # a reference to the class dictionary under Python 2 self.dispatch = Pickler.dispatch.copy() else: # Under Python 3 initialize the dispatch table with a copy of the # default registry self.dispatch_table = copyreg.dispatch_table.copy() for type, reduce_func in reducers.items(): self.register(type, reduce_func) def register(self, type, reduce_func): if hasattr(Pickler, 'dispatch'): # Python 2 pickler dispatching is not explicitly customizable. # Let us use a closure to workaround this limitation. def dispatcher(self, obj): reduced = reduce_func(obj) self.save_reduce(obj=obj, *reduced) self.dispatch[type] = dispatcher else: self.dispatch_table[type] = reduce_func class CustomizablePicklingQueue(object): """Locked Pipe implementation that uses a customizable pickler. This class is an alternative to the multiprocessing implementation of SimpleQueue in order to make it possible to pass custom pickling reducers, for instance to avoid memory copy when passing memmory mapped datastructures. `reducers` is expected expected to be a dictionary with key/values being `(type, callable)` pairs where `callable` is a function that give an instance of `type` will return a tuple `(constructor, tuple_of_objects)` to rebuild an instance out of the pickled `tuple_of_objects` as would return a `__reduce__` method. See the standard library documentation on pickling for more details. """ def __init__(self, context, reducers=None): self._reducers = reducers self._reader, self._writer = context.Pipe(duplex=False) self._rlock = context.Lock() if sys.platform == 'win32': self._wlock = None else: self._wlock = context.Lock() self._make_methods() def __getstate__(self): assert_spawning(self) return (self._reader, self._writer, self._rlock, self._wlock, self._reducers) def __setstate__(self, state): (self._reader, self._writer, self._rlock, self._wlock, self._reducers) = state self._make_methods() def empty(self): return not self._reader.poll() def _make_methods(self): self._recv = recv = self._reader.recv racquire, rrelease = self._rlock.acquire, self._rlock.release def get(): racquire() try: return recv() finally: rrelease() self.get = get if self._reducers: def send(obj): buffer = BytesIO() CustomizablePickler(buffer, self._reducers).dump(obj) self._writer.send_bytes(buffer.getvalue()) self._send = send else: self._send = send = self._writer.send if self._wlock is None: # writes to a message oriented win32 pipe are atomic self.put = send else: wlock_acquire, wlock_release = ( self._wlock.acquire, self._wlock.release) def put(obj): wlock_acquire() try: return send(obj) finally: wlock_release() self.put = put class PicklingPool(Pool): """Pool implementation with customizable pickling reducers. This is useful to control how data is shipped between processes and makes it possible to use shared memory without useless copies induces by the default pickling methods of the original objects passed as arguments to dispatch. `forward_reducers` and `backward_reducers` are expected to be dictionaries with key/values being `(type, callable)` pairs where `callable` is a function that give an instance of `type` will return a tuple `(constructor, tuple_of_objects)` to rebuild an instance out of the pickled `tuple_of_objects` as would return a `__reduce__` method. See the standard library documentation on pickling for more details. """ def __init__(self, processes=None, forward_reducers=None, backward_reducers=None, **kwargs): if forward_reducers is None: forward_reducers = dict() if backward_reducers is None: backward_reducers = dict() self._forward_reducers = forward_reducers self._backward_reducers = backward_reducers poolargs = dict(processes=processes) poolargs.update(kwargs) super(PicklingPool, self).__init__(**poolargs) def _setup_queues(self): context = getattr(self, '_ctx', mp) self._inqueue = CustomizablePicklingQueue(context, self._forward_reducers) self._outqueue = CustomizablePicklingQueue(context, self._backward_reducers) self._quick_put = self._inqueue._send self._quick_get = self._outqueue._recv def delete_folder(folder_path): """Utility function to cleanup a temporary folder if still existing""" try: if os.path.exists(folder_path): shutil.rmtree(folder_path) except WindowsError: warnings.warn("Failed to clean temporary folder: %s" % folder_path) class MemmapingPool(PicklingPool): """Process pool that shares large arrays to avoid memory copy. This drop-in replacement for `multiprocessing.pool.Pool` makes it possible to work efficiently with shared memory in a numpy context. Existing instances of numpy.memmap are preserved: the child suprocesses will have access to the same shared memory in the original mode except for the 'w+' mode that is automatically transformed as 'r+' to avoid zeroing the original data upon instantiation. Furthermore large arrays from the parent process are automatically dumped to a temporary folder on the filesystem such as child processes to access their content via memmaping (file system backed shared memory). Note: it is important to call the terminate method to collect the temporary folder used by the pool. Parameters ---------- processes: int, optional Number of worker processes running concurrently in the pool. initializer: callable, optional Callable executed on worker process creation. initargs: tuple, optional Arguments passed to the initializer callable. temp_folder: str, optional Folder to be used by the pool for memmaping large arrays for sharing memory with worker processes. If None, this will try in order: - a folder pointed by the JOBLIB_TEMP_FOLDER environment variable, - /dev/shm if the folder exists and is writable: this is a RAMdisk filesystem available by default on modern Linux distributions, - the default system temporary folder that can be overridden with TMP, TMPDIR or TEMP environment variables, typically /tmp under Unix operating systems. max_nbytes int or None, optional, 1e6 by default Threshold on the size of arrays passed to the workers that triggers automated memmory mapping in temp_folder. Use None to disable memmaping of large arrays. forward_reducers: dictionary, optional Reducers used to pickle objects passed from master to worker processes: see below. backward_reducers: dictionary, optional Reducers used to pickle return values from workers back to the master process. verbose: int, optional Make it possible to monitor how the communication of numpy arrays with the subprocess is handled (pickling or memmaping) prewarm: bool or str, optional, "auto" by default. If True, force a read on newly memmaped array to make sure that OS pre- cache it in memory. This can be useful to avoid concurrent disk access when the same data array is passed to different worker processes. If "auto" (by default), prewarm is set to True, unless the Linux shared memory partition /dev/shm is available and used as temp_folder. `forward_reducers` and `backward_reducers` are expected to be dictionaries with key/values being `(type, callable)` pairs where `callable` is a function that give an instance of `type` will return a tuple `(constructor, tuple_of_objects)` to rebuild an instance out of the pickled `tuple_of_objects` as would return a `__reduce__` method. See the standard library documentation on pickling for more details. """ def __init__(self, processes=None, temp_folder=None, max_nbytes=1e6, mmap_mode='r', forward_reducers=None, backward_reducers=None, verbose=0, context_id=None, prewarm=False, **kwargs): if forward_reducers is None: forward_reducers = dict() if backward_reducers is None: backward_reducers = dict() if context_id is not None: warnings.warn('context_id is deprecated and ignored in joblib' ' 0.9.4 and will be removed in 0.11', DeprecationWarning) # Prepare a sub-folder name for the serialization of this particular # pool instance (do not create in advance to spare FS write access if # no array is to be dumped): use_shared_mem = False pool_folder_name = "joblib_memmaping_pool_%d_%d" % ( os.getpid(), id(self)) if temp_folder is None: temp_folder = os.environ.get('JOBLIB_TEMP_FOLDER', None) if temp_folder is None: if os.path.exists(SYSTEM_SHARED_MEM_FS): try: temp_folder = SYSTEM_SHARED_MEM_FS pool_folder = os.path.join(temp_folder, pool_folder_name) if not os.path.exists(pool_folder): os.makedirs(pool_folder) use_shared_mem = True except IOError: # Missing rights in the the /dev/shm partition, # fallback to regular temp folder. temp_folder = None if temp_folder is None: # Fallback to the default tmp folder, typically /tmp temp_folder = tempfile.gettempdir() temp_folder = os.path.abspath(os.path.expanduser(temp_folder)) pool_folder = os.path.join(temp_folder, pool_folder_name) self._temp_folder = pool_folder # Register the garbage collector at program exit in case caller forgets # to call terminate explicitly: note we do not pass any reference to # self to ensure that this callback won't prevent garbage collection of # the pool instance and related file handler resources such as POSIX # semaphores and pipes atexit.register(lambda: delete_folder(pool_folder)) if np is not None: # Register smart numpy.ndarray reducers that detects memmap backed # arrays and that is alse able to dump to memmap large in-memory # arrays over the max_nbytes threshold if prewarm == "auto": prewarm = not use_shared_mem forward_reduce_ndarray = ArrayMemmapReducer( max_nbytes, pool_folder, mmap_mode, verbose, prewarm=prewarm) forward_reducers[np.ndarray] = forward_reduce_ndarray forward_reducers[np.memmap] = reduce_memmap # Communication from child process to the parent process always # pickles in-memory numpy.ndarray without dumping them as memmap # to avoid confusing the caller and make it tricky to collect the # temporary folder backward_reduce_ndarray = ArrayMemmapReducer( None, pool_folder, mmap_mode, verbose) backward_reducers[np.ndarray] = backward_reduce_ndarray backward_reducers[np.memmap] = reduce_memmap poolargs = dict( processes=processes, forward_reducers=forward_reducers, backward_reducers=backward_reducers) poolargs.update(kwargs) super(MemmapingPool, self).__init__(**poolargs) def terminate(self): super(MemmapingPool, self).terminate() delete_folder(self._temp_folder) joblib-0.9.4/joblib/test/000077500000000000000000000000001264716474700152275ustar00rootroot00000000000000joblib-0.9.4/joblib/test/__init__.py000066400000000000000000000001111264716474700173310ustar00rootroot00000000000000from joblib.test import test_memory from joblib.test import test_hashing joblib-0.9.4/joblib/test/common.py000066400000000000000000000044721264716474700171000ustar00rootroot00000000000000""" Small utilities for testing. """ import threading import signal import nose import time import os import sys from joblib._multiprocessing_helpers import mp from nose import SkipTest from nose.tools import with_setup # A decorator to run tests only when numpy is available try: import numpy as np def with_numpy(func): """ A decorator to skip tests requiring numpy. """ return func except ImportError: def with_numpy(func): """ A decorator to skip tests requiring numpy. """ def my_func(): raise nose.SkipTest('Test requires numpy') return my_func np = None # A utility to kill the test runner in case a multiprocessing assumption # triggers an infinite wait on a pipe by the master process for one of its # failed workers _KILLER_THREADS = dict() def setup_autokill(module_name, timeout=30): """Timeout based suiciding thread to kill the test runner process If some subprocess dies in an unexpected way we don't want the parent process to block indefinitely. """ if "NO_AUTOKILL" in os.environ or "--pdb" in sys.argv: # Do not install the autokiller return # Renew any previous contract under that name by first cancelling the # previous version (that should normally not happen in practice) teardown_autokill(module_name) def autokill(): pid = os.getpid() print("Timeout exceeded: terminating stalled process: %d" % pid) os.kill(pid, signal.SIGTERM) # If were are still there ask the OS to kill ourself for real time.sleep(0.5) print("Timeout exceeded: killing stalled process: %d" % pid) os.kill(pid, signal.SIGKILL) _KILLER_THREADS[module_name] = t = threading.Timer(timeout, autokill) t.start() def teardown_autokill(module_name): """Cancel a previously started killer thread""" killer = _KILLER_THREADS.get(module_name) if killer is not None: killer.cancel() def check_multiprocessing(): if mp is None: raise SkipTest('Need multiprocessing to run') with_multiprocessing = with_setup(check_multiprocessing) def setup_if_has_dev_shm(): if not os.path.exists('/dev/shm'): raise SkipTest("This test requires the /dev/shm shared memory fs.") with_dev_shm = with_setup(setup_if_has_dev_shm) joblib-0.9.4/joblib/test/data/000077500000000000000000000000001264716474700161405ustar00rootroot00000000000000joblib-0.9.4/joblib/test/data/__init__.py000066400000000000000000000000001264716474700202370ustar00rootroot00000000000000joblib-0.9.4/joblib/test/data/create_numpy_pickle.py000066400000000000000000000041151264716474700225350ustar00rootroot00000000000000""" This script is used to generate test data for joblib/test/test_numpy_pickle.py """ import sys import re # nosetests needs to be able to import this module even when numpy is # not installed try: import numpy as np except ImportError: np = None import joblib def get_joblib_version(joblib_version=joblib.__version__): """Normalise joblib version by removing suffix >>> get_joblib_version('0.8.4') '0.8.4' >>> get_joblib_version('0.8.4b1') '0.8.4' >>> get_joblib_version('0.9.dev0') '0.9' """ matches = [re.match(r'(\d+).*', each) for each in joblib_version.split('.')] return '.'.join([m.group(1) for m in matches if m is not None]) def write_test_pickle(to_pickle): joblib_version = get_joblib_version() py_version = '{0[0]}{0[1]}'.format(sys.version_info) numpy_version = ''.join(np.__version__.split('.')[:2]) print('file:', np.__file__) pickle_filename = 'joblib_{0}_compressed_pickle_py{1}_np{2}.gz'.format( joblib_version, py_version, numpy_version) joblib.dump(to_pickle, pickle_filename, compress=True) pickle_filename = 'joblib_{0}_pickle_py{1}_np{2}.pkl'.format( joblib_version, py_version, numpy_version) joblib.dump(to_pickle, pickle_filename, compress=False) if __name__ == '__main__': # We need to be specific about dtypes in particular endianness # because the pickles can be generated on one architecture and # the tests run on another one. See # https://github.com/joblib/joblib/issues/279. to_pickle = [np.arange(5, dtype=np.dtype('csΛ`K.[bk֮[a[n۾c{G;~gΞ;W^~w޻O>{o޾{_~_ g+KTP,Jjoblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py26_np16.pkl000066400000000000000000000010731264716474700237130ustar00rootroot00000000000000]q(cjoblib.numpy_pickle NDArrayWrapper q)q}q(U allow_mmapqUsubclassqcnumpy ndarray qUfilenameqU(joblib_0.9.2_pickle_py26_np16.pkl_01.npyqubh)q }q (hhhhU(joblib_0.9.2_pickle_py26_np16.pkl_02.npyq ubh)q }q (hhhhU(joblib_0.9.2_pickle_py26_np16.pkl_03.npyqubT  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~qXC'est l'été !qe.joblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py26_np16.pkl_01.npy000066400000000000000000000001701264716474700250150ustar00rootroot00000000000000NUMPYF{'descr': '?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~qXC'est l'été !qe.joblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py27_np17.pkl_01.npy000066400000000000000000000001701264716474700250170ustar00rootroot00000000000000NUMPYF{'descr': '?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~qXC'est l'été !qe.joblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py33_np18.pkl_01.npy000066400000000000000000000001701264716474700250150ustar00rootroot00000000000000NUMPYF{'descr': '?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~qXC'est l'été !qe.joblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py34_np19.pkl_01.npy000066400000000000000000000001701264716474700250170ustar00rootroot00000000000000NUMPYF{'descr': '?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~qXC'est l'été !qe.joblib-0.9.4/joblib/test/data/joblib_0.9.2_pickle_py35_np19.pkl_01.npy000066400000000000000000000001701264716474700250200ustar00rootroot00000000000000NUMPYF{'descr': ' # Lars Buitinck # Copyright (c) 2010 Gael Varoquaux # License: BSD Style, 3 clauses. from __future__ import with_statement import os import shutil import array from tempfile import mkdtemp import nose from joblib.disk import disk_used, memstr_to_kbytes, mkdirp ############################################################################### def test_disk_used(): cachedir = mkdtemp() try: if os.path.exists(cachedir): shutil.rmtree(cachedir) os.mkdir(cachedir) # Not write a file that is 1M big in this directory, and check the # size. The reason we use such a big file is that it makes us robust # to errors due to block allocation. a = array.array('i') sizeof_i = a.itemsize target_size = 1024 n = int(target_size * 1024 / sizeof_i) a = array.array('i', n * (1,)) with open(os.path.join(cachedir, 'test'), 'wb') as output: a.tofile(output) nose.tools.assert_true(disk_used(cachedir) >= target_size) nose.tools.assert_true(disk_used(cachedir) < target_size + 12) finally: shutil.rmtree(cachedir) def test_memstr_to_kbytes(): for text, value in zip(('80G', '1.4M', '120M', '53K'), (80 * 1024 ** 2, int(1.4 * 1024), 120 * 1024, 53)): yield nose.tools.assert_equal, memstr_to_kbytes(text), value nose.tools.assert_raises(ValueError, memstr_to_kbytes, 'foobar') def test_mkdirp(): try: tmp = mkdtemp() mkdirp(os.path.join(tmp, "ham")) mkdirp(os.path.join(tmp, "ham")) mkdirp(os.path.join(tmp, "spam", "spam")) # Not all OSErrors are ignored nose.tools.assert_raises(OSError, mkdirp, "") finally: shutil.rmtree(tmp) joblib-0.9.4/joblib/test/test_format_stack.py000066400000000000000000000033571264716474700213250ustar00rootroot00000000000000""" Unit tests for the stack formatting utilities """ # Author: Gael Varoquaux # Copyright (c) 2010 Gael Varoquaux # License: BSD Style, 3 clauses. import nose import sys from joblib.format_stack import safe_repr, _fixed_getframes, format_records ############################################################################### class Vicious(object): def __repr__(self): raise ValueError def test_safe_repr(): safe_repr(Vicious()) def _change_file_extensions_to_pyc(record): _1, filename, _2, _3, _4, _5 = record if filename.endswith('.py'): filename += 'c' return _1, filename, _2, _3, _4, _5 def _raise_exception(a, b): """Function that raises with a non trivial call stack """ def helper(a, b): raise ValueError('Nope, this can not work') helper(a, b) def test_format_records(): try: _raise_exception('a', 42) except ValueError: etb = sys.exc_info()[2] records = _fixed_getframes(etb) # Modify filenames in traceback records from .py to .pyc pyc_records = [_change_file_extensions_to_pyc(record) for record in records] formatted_records = format_records(pyc_records) # Check that the .py file and not the .pyc one is listed in # the traceback for fmt_rec in formatted_records: assert 'test_format_stack.py in' in fmt_rec # Check exception stack assert "_raise_exception('a', 42)" in formatted_records[0] assert 'helper(a, b)' in formatted_records[1] assert "a = 'a'" in formatted_records[1] assert 'b = 42' in formatted_records[1] assert 'Nope, this can not work' in formatted_records[2] joblib-0.9.4/joblib/test/test_func_inspect.py000066400000000000000000000204331264716474700213220ustar00rootroot00000000000000""" Test the func_inspect module. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import os import shutil import nose import tempfile import functools import sys from joblib.func_inspect import filter_args, get_func_name, get_func_code from joblib.func_inspect import _clean_win_chars, format_signature from joblib.memory import Memory from joblib.test.common import with_numpy from joblib.testing import assert_raises_regex from joblib._compat import PY3_OR_LATER ############################################################################### # Module-level functions, for tests def f(x, y=0): pass def g(x, y=1): """ A module-level function for testing purposes. """ return x ** 2 + y def f2(x): pass # Create a Memory object to test decorated functions. # We should be careful not to call the decorated functions, so that # cache directories are not created in the temp dir. temp_folder = tempfile.mkdtemp(prefix="joblib_test_func_inspect_") mem = Memory(cachedir=temp_folder) def teardown_module(): if os.path.exists(temp_folder): try: shutil.rmtree(temp_folder) except Exception as e: print("Failed to delete temporary folder %s: %r" % (temp_folder, e)) @mem.cache def g(x): return x def h(x, y=0, *args, **kwargs): pass def i(x=1): pass def j(x, y, **kwargs): pass def k(*args, **kwargs): pass class Klass(object): def f(self, x): return x ############################################################################### # Tests def test_filter_args(): yield nose.tools.assert_equal, filter_args(f, [], (1, )),\ {'x': 1, 'y': 0} yield nose.tools.assert_equal, filter_args(f, ['x'], (1, )),\ {'y': 0} yield nose.tools.assert_equal, filter_args(f, ['y'], (0, )),\ {'x': 0} yield nose.tools.assert_equal, filter_args(f, ['y'], (0, ), dict(y=1)), {'x': 0} yield nose.tools.assert_equal, filter_args(f, ['x', 'y'], (0, )), {} yield nose.tools.assert_equal, filter_args(f, [], (0,), dict(y=1)), {'x': 0, 'y': 1} yield nose.tools.assert_equal, filter_args(f, ['y'], (), dict(x=2, y=1)), {'x': 2} yield nose.tools.assert_equal, filter_args(i, [], (2, )), {'x': 2} yield nose.tools.assert_equal, filter_args(f2, [], (), dict(x=1)), {'x': 1} def test_filter_args_method(): obj = Klass() nose.tools.assert_equal(filter_args(obj.f, [], (1, )), {'x': 1, 'self': obj}) def test_filter_varargs(): yield nose.tools.assert_equal, filter_args(h, [], (1, )), \ {'x': 1, 'y': 0, '*': [], '**': {}} yield nose.tools.assert_equal, filter_args(h, [], (1, 2, 3, 4)), \ {'x': 1, 'y': 2, '*': [3, 4], '**': {}} yield nose.tools.assert_equal, filter_args(h, [], (1, 25), dict(ee=2)), \ {'x': 1, 'y': 25, '*': [], '**': {'ee': 2}} yield nose.tools.assert_equal, filter_args(h, ['*'], (1, 2, 25), dict(ee=2)), \ {'x': 1, 'y': 2, '**': {'ee': 2}} def test_filter_kwargs(): nose.tools.assert_equal(filter_args(k, [], (1, 2), dict(ee=2)), {'*': [1, 2], '**': {'ee': 2}}) nose.tools.assert_equal(filter_args(k, [], (3, 4)), {'*': [3, 4], '**': {}}) def test_filter_args_2(): nose.tools.assert_equal(filter_args(j, [], (1, 2), dict(ee=2)), {'x': 1, 'y': 2, '**': {'ee': 2}}) nose.tools.assert_raises(ValueError, filter_args, f, 'a', (None, )) # Check that we capture an undefined argument nose.tools.assert_raises(ValueError, filter_args, f, ['a'], (None, )) ff = functools.partial(f, 1) # filter_args has to special-case partial nose.tools.assert_equal(filter_args(ff, [], (1, )), {'*': [1], '**': {}}) nose.tools.assert_equal(filter_args(ff, ['y'], (1, )), {'*': [1], '**': {}}) def test_func_name(): yield nose.tools.assert_equal, 'f', get_func_name(f)[1] # Check that we are not confused by the decoration yield nose.tools.assert_equal, 'g', get_func_name(g)[1] def test_func_inspect_errors(): # Check that func_inspect is robust and will work on weird objects nose.tools.assert_equal(get_func_name('a'.lower)[-1], 'lower') nose.tools.assert_equal(get_func_code('a'.lower)[1:], (None, -1)) ff = lambda x: x nose.tools.assert_equal(get_func_name(ff, win_characters=False)[-1], '') nose.tools.assert_equal(get_func_code(ff)[1], __file__.replace('.pyc', '.py')) # Simulate a function defined in __main__ ff.__module__ = '__main__' nose.tools.assert_equal(get_func_name(ff, win_characters=False)[-1], '') nose.tools.assert_equal(get_func_code(ff)[1], __file__.replace('.pyc', '.py')) if PY3_OR_LATER: exec(""" def func_with_kwonly_args(a, b, *, kw1='kw1', kw2='kw2'): pass def func_with_signature(a: int, b: int) -> None: pass """) def test_filter_args_python_3(): nose.tools.assert_equal( filter_args(func_with_kwonly_args, [], (1, 2), {'kw1': 3, 'kw2': 4}), {'a': 1, 'b': 2, 'kw1': 3, 'kw2': 4}) # filter_args doesn't care about keyword-only arguments so you # can pass 'kw1' into *args without any problem assert_raises_regex( ValueError, "Keyword-only parameter 'kw1' was passed as positional parameter", filter_args, func_with_kwonly_args, [], (1, 2, 3), {'kw2': 2}) nose.tools.assert_equal( filter_args(func_with_kwonly_args, ['b', 'kw2'], (1, 2), {'kw1': 3, 'kw2': 4}), {'a': 1, 'kw1': 3}) nose.tools.assert_equal( filter_args(func_with_signature, ['b'], (1, 2)), {'a': 1}) def test_bound_methods(): """ Make sure that calling the same method on two different instances of the same class does resolv to different signatures. """ a = Klass() b = Klass() nose.tools.assert_not_equal(filter_args(a.f, [], (1, )), filter_args(b.f, [], (1, ))) def test_filter_args_error_msg(): """ Make sure that filter_args returns decent error messages, for the sake of the user. """ nose.tools.assert_raises(ValueError, filter_args, f, []) def test_clean_win_chars(): string = r'C:\foo\bar\main.py' mangled_string = _clean_win_chars(string) for char in ('\\', ':', '<', '>', '!'): nose.tools.assert_false(char in mangled_string) def test_format_signature(): # Test signature formatting. path, sgn = format_signature(g, list(range(10))) nose.tools.assert_equal(sgn, 'g([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])') path, sgn = format_signature(g, list(range(10)), y=list(range(10))) nose.tools.assert_equal(sgn, 'g([0, 1, 2, 3, 4, 5, 6, 7, 8, 9],' ' y=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9])') @with_numpy def test_format_signature_numpy(): """ Test the format signature formatting with numpy. """ def test_special_source_encoding(): from joblib.test.test_func_inspect_special_encoding import big5_f func_code, source_file, first_line = get_func_code(big5_f) nose.tools.assert_equal(first_line, 5) nose.tools.assert_true("def big5_f():" in func_code) nose.tools.assert_true("test_func_inspect_special_encoding" in source_file) def _get_code(): from joblib.test.test_func_inspect_special_encoding import big5_f return get_func_code(big5_f)[0] def test_func_code_consistency(): from joblib.parallel import Parallel, delayed codes = Parallel(n_jobs=2)(delayed(_get_code)() for _ in range(5)) nose.tools.assert_equal(len(set(codes)), 1) joblib-0.9.4/joblib/test/test_func_inspect_special_encoding.py000066400000000000000000000002221264716474700246620ustar00rootroot00000000000000# -*- coding: big5 -*- # Some Traditional Chinese characters: @Ǥr def big5_f(): """Ωժ """ # return 0 joblib-0.9.4/joblib/test/test_hashing.py000066400000000000000000000343141264716474700202660ustar00rootroot00000000000000""" Test the hashing module. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import nose import time import hashlib import tempfile import os import sys import gc import io import collections import pickle import random from nose.tools import assert_equal from joblib.hashing import hash from joblib.func_inspect import filter_args from joblib.memory import Memory from joblib.testing import assert_raises_regex from joblib.test.test_memory import env as test_memory_env from joblib.test.test_memory import setup_module as test_memory_setup_func from joblib.test.test_memory import teardown_module as test_memory_teardown_func from joblib.test.common import np, with_numpy from joblib.my_exceptions import TransportableException from joblib._compat import PY3_OR_LATER try: # Python 2/Python 3 compat unicode('str') except NameError: unicode = lambda s: s def assert_less(a, b): if a > b: raise AssertionError("%r is not lower than %r") ############################################################################### # Helper functions for the tests def time_func(func, *args): """ Time function func on *args. """ times = list() for _ in range(3): t1 = time.time() func(*args) times.append(time.time() - t1) return min(times) def relative_time(func1, func2, *args): """ Return the relative time between func1 and func2 applied on *args. """ time_func1 = time_func(func1, *args) time_func2 = time_func(func2, *args) relative_diff = 0.5 * (abs(time_func1 - time_func2) / (time_func1 + time_func2)) return relative_diff class Klass(object): def f(self, x): return x class KlassWithCachedMethod(object): def __init__(self): mem = Memory(cachedir=test_memory_env['dir']) self.f = mem.cache(self.f) def f(self, x): return x ############################################################################### # Tests def test_trival_hash(): """ Smoke test hash on various types. """ obj_list = [1, 2, 1., 2., 1 + 1j, 2. + 1j, 'a', 'b', (1, ), (1, 1, ), [1, ], [1, 1, ], {1: 1}, {1: 2}, {2: 1}, None, gc.collect, [1, ].append, # Next 2 sets have unorderable elements in python 3. set(('a', 1)), set(('a', 1, ('a', 1))), # Next 2 dicts have unorderable type of keys in python 3. {'a': 1, 1: 2}, {'a': 1, 1: 2, 'd': {'a': 1}}, ] for obj1 in obj_list: for obj2 in obj_list: # Check that 2 objects have the same hash only if they are # the same. yield nose.tools.assert_equal, hash(obj1) == hash(obj2), \ obj1 is obj2 def test_hash_methods(): # Check that hashing instance methods works a = io.StringIO(unicode('a')) nose.tools.assert_equal(hash(a.flush), hash(a.flush)) a1 = collections.deque(range(10)) a2 = collections.deque(range(9)) nose.tools.assert_not_equal(hash(a1.extend), hash(a2.extend)) @with_numpy def test_hash_numpy(): """ Test hashing with numpy arrays. """ rnd = np.random.RandomState(0) arr1 = rnd.random_sample((10, 10)) arr2 = arr1.copy() arr3 = arr2.copy() arr3[0] += 1 obj_list = (arr1, arr2, arr3) for obj1 in obj_list: for obj2 in obj_list: yield nose.tools.assert_equal, hash(obj1) == hash(obj2), \ np.all(obj1 == obj2) d1 = {1: arr1, 2: arr1} d2 = {1: arr2, 2: arr2} yield nose.tools.assert_equal, hash(d1), hash(d2) d3 = {1: arr2, 2: arr3} yield nose.tools.assert_not_equal, hash(d1), hash(d3) yield nose.tools.assert_not_equal, hash(arr1), hash(arr1.T) @with_numpy def test_numpy_datetime_array(): # memoryview is not supported for some dtypes e.g. datetime64 # see https://github.com/joblib/joblib/issues/188 for more details dtypes = ['datetime64[s]', 'timedelta64[D]'] a_hash = hash(np.arange(10)) arrays = (np.arange(0, 10, dtype=dtype) for dtype in dtypes) for array in arrays: nose.tools.assert_not_equal(hash(array), a_hash) @with_numpy def test_hash_numpy_noncontiguous(): a = np.asarray(np.arange(6000).reshape((1000, 2, 3)), order='F')[:, :1, :] b = np.ascontiguousarray(a) nose.tools.assert_not_equal(hash(a), hash(b)) c = np.asfortranarray(a) nose.tools.assert_not_equal(hash(a), hash(c)) @with_numpy def test_hash_memmap(): """ Check that memmap and arrays hash identically if coerce_mmap is True. """ filename = tempfile.mktemp(prefix='joblib_test_hash_memmap_') try: m = np.memmap(filename, shape=(10, 10), mode='w+') a = np.asarray(m) for coerce_mmap in (False, True): yield (nose.tools.assert_equal, hash(a, coerce_mmap=coerce_mmap) == hash(m, coerce_mmap=coerce_mmap), coerce_mmap) finally: if 'm' in locals(): del m # Force a garbage-collection cycle, to be certain that the # object is delete, and we don't run in a problem under # Windows with a file handle still open. gc.collect() try: os.unlink(filename) except OSError as e: # Under windows, some files don't get erased. if not os.name == 'nt': raise e @with_numpy def test_hash_numpy_performance(): """ Check the performance of hashing numpy arrays: In [22]: a = np.random.random(1000000) In [23]: %timeit hashlib.md5(a).hexdigest() 100 loops, best of 3: 20.7 ms per loop In [24]: %timeit hashlib.md5(pickle.dumps(a, protocol=2)).hexdigest() 1 loops, best of 3: 73.1 ms per loop In [25]: %timeit hashlib.md5(cPickle.dumps(a, protocol=2)).hexdigest() 10 loops, best of 3: 53.9 ms per loop In [26]: %timeit hash(a) 100 loops, best of 3: 20.8 ms per loop """ # This test is not stable under windows for some reason, skip it. if sys.platform == 'win32': raise nose.SkipTest() rnd = np.random.RandomState(0) a = rnd.random_sample(1000000) if hasattr(np, 'getbuffer'): # Under python 3, there is no getbuffer getbuffer = np.getbuffer else: getbuffer = memoryview md5_hash = lambda x: hashlib.md5(getbuffer(x)).hexdigest() relative_diff = relative_time(md5_hash, hash, a) assert_less(relative_diff, 0.3) # Check that hashing an tuple of 3 arrays takes approximately # 3 times as much as hashing one array time_hashlib = 3 * time_func(md5_hash, a) time_hash = time_func(hash, (a, a, a)) relative_diff = 0.5 * (abs(time_hash - time_hashlib) / (time_hash + time_hashlib)) assert_less(relative_diff, 0.3) def test_bound_methods_hash(): """ Make sure that calling the same method on two different instances of the same class does resolve to the same hashes. """ a = Klass() b = Klass() nose.tools.assert_equal(hash(filter_args(a.f, [], (1, ))), hash(filter_args(b.f, [], (1, )))) @nose.tools.with_setup(test_memory_setup_func, test_memory_teardown_func) def test_bound_cached_methods_hash(): """ Make sure that calling the same _cached_ method on two different instances of the same class does resolve to the same hashes. """ a = KlassWithCachedMethod() b = KlassWithCachedMethod() nose.tools.assert_equal(hash(filter_args(a.f.func, [], (1, ))), hash(filter_args(b.f.func, [], (1, )))) @with_numpy def test_hash_object_dtype(): """ Make sure that ndarrays with dtype `object' hash correctly.""" a = np.array([np.arange(i) for i in range(6)], dtype=object) b = np.array([np.arange(i) for i in range(6)], dtype=object) nose.tools.assert_equal(hash(a), hash(b)) @with_numpy def test_numpy_scalar(): # Numpy scalars are built from compiled functions, and lead to # strange pickling paths explored, that can give hash collisions a = np.float64(2.0) b = np.float64(3.0) nose.tools.assert_not_equal(hash(a), hash(b)) @nose.tools.with_setup(test_memory_setup_func, test_memory_teardown_func) def test_dict_hash(): # Check that dictionaries hash consistently, eventhough the ordering # of the keys is not garanteed k = KlassWithCachedMethod() d = {'#s12069__c_maps.nii.gz': [33], '#s12158__c_maps.nii.gz': [33], '#s12258__c_maps.nii.gz': [33], '#s12277__c_maps.nii.gz': [33], '#s12300__c_maps.nii.gz': [33], '#s12401__c_maps.nii.gz': [33], '#s12430__c_maps.nii.gz': [33], '#s13817__c_maps.nii.gz': [33], '#s13903__c_maps.nii.gz': [33], '#s13916__c_maps.nii.gz': [33], '#s13981__c_maps.nii.gz': [33], '#s13982__c_maps.nii.gz': [33], '#s13983__c_maps.nii.gz': [33]} a = k.f(d) b = k.f(a) nose.tools.assert_equal(hash(a), hash(b)) @nose.tools.with_setup(test_memory_setup_func, test_memory_teardown_func) def test_set_hash(): # Check that sets hash consistently, eventhough their ordering # is not garanteed k = KlassWithCachedMethod() s = set(['#s12069__c_maps.nii.gz', '#s12158__c_maps.nii.gz', '#s12258__c_maps.nii.gz', '#s12277__c_maps.nii.gz', '#s12300__c_maps.nii.gz', '#s12401__c_maps.nii.gz', '#s12430__c_maps.nii.gz', '#s13817__c_maps.nii.gz', '#s13903__c_maps.nii.gz', '#s13916__c_maps.nii.gz', '#s13981__c_maps.nii.gz', '#s13982__c_maps.nii.gz', '#s13983__c_maps.nii.gz']) a = k.f(s) b = k.f(a) nose.tools.assert_equal(hash(a), hash(b)) def test_string(): # Test that we obtain the same hash for object owning several strings, # whatever the past of these strings (which are immutable in Python) string = 'foo' a = {string: 'bar'} b = {string: 'bar'} c = pickle.loads(pickle.dumps(b)) assert_equal(hash([a, b]), hash([a, c])) @with_numpy def test_dtype(): # Test that we obtain the same hash for object owning several dtype, # whatever the past of these dtypes. Catter for cache invalidation with # complex dtype a = np.dtype([('f1', np.uint), ('f2', np.int32)]) b = a c = pickle.loads(pickle.dumps(a)) assert_equal(hash([a, c]), hash([a, b])) def test_hashes_stay_the_same(): # We want to make sure that hashes don't change with joblib # version. For end users, that would mean that they have to # regenerate their cache from scratch, which potentially means # lengthy recomputations. rng = random.Random(42) to_hash_list = ['This is a string to hash', u"C'est l\xe9t\xe9", (123456, 54321, -98765), [rng.random() for _ in range(5)], [3, 'abc', None, TransportableException('the message', ValueError)], {'abcde': 123, 'sadfas': [-9999, 2, 3]}] # These expected results have been generated with joblib 0.9.2 expected_dict = { 'py2': ['80436ada343b0d79a99bfd8883a96e45', '2ff3a25200eb6219f468de2640913c2d', '50d81c80af05061ac4dcdc2d5edee6d6', '536af09b66a087ed18b515acc17dc7fc', '123ffc6f13480767167e171a8e1f6f4a', 'fc9314a39ff75b829498380850447047'], 'py3': ['71b3f47df22cb19431d85d92d0b230b2', '2d8d189e9b2b0b2e384d93c868c0e576', 'e205227dd82250871fa25aa0ec690aa3', '9e4e9bf9b91890c9734a6111a35e6633', '6065a3c48e842ea5dee2cfd0d6820ad6', 'aeda150553d4bb5c69f0e69d51b0e2ef']} py_version_str = 'py3' if PY3_OR_LATER else 'py2' expected_list = expected_dict[py_version_str] for to_hash, expected in zip(to_hash_list, expected_list): yield assert_equal, hash(to_hash), expected @with_numpy def test_hashes_stay_the_same_with_numpy_objects(): # We want to make sure that hashes don't change with joblib # version. For end users, that would mean that they have to # regenerate their cache from scratch, which potentially means # lengthy recomputations. rng = np.random.RandomState(42) # Being explicit about dtypes in order to avoid # architecture-related differences. Also using 'f4' rather than # 'f8' for float arrays because 'f8' arrays generated by # rng.random.randn don't seem to be bit-identical on 32bit and # 64bit machines. to_hash_list = [ rng.randint(-1000, high=1000, size=50).astype(' # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import shutil import os import sys import io from tempfile import mkdtemp import re from joblib.logger import PrintTime try: # Python 2/Python 3 compat unicode('str') except NameError: unicode = lambda s: s ############################################################################### # Test fixtures env = dict() def setup(): """ Test setup. """ cachedir = mkdtemp() if os.path.exists(cachedir): shutil.rmtree(cachedir) env['dir'] = cachedir def teardown(): """ Test teardown. """ #return True shutil.rmtree(env['dir']) ############################################################################### # Tests def test_print_time(): # A simple smoke test for PrintTime. try: orig_stderr = sys.stderr sys.stderr = io.StringIO() print_time = PrintTime(logfile=os.path.join(env['dir'], 'test.log')) print_time(unicode('Foo')) # Create a second time, to smoke test log rotation. print_time = PrintTime(logfile=os.path.join(env['dir'], 'test.log')) print_time(unicode('Foo')) # And a third time print_time = PrintTime(logfile=os.path.join(env['dir'], 'test.log')) print_time(unicode('Foo')) printed_text = sys.stderr.getvalue() # Use regexps to be robust to time variations match = r"Foo: 0\..s, 0\..min\nFoo: 0\..s, 0..min\nFoo: " + \ r".\..s, 0..min\n" if not re.match(match, printed_text): raise AssertionError('Excepted %s, got %s' % (match, printed_text)) finally: sys.stderr = orig_stderr joblib-0.9.4/joblib/test/test_memory.py000066400000000000000000000522761264716474700201640ustar00rootroot00000000000000""" Test the memory module. """ # Author: Gael Varoquaux # Copyright (c) 2009 Gael Varoquaux # License: BSD Style, 3 clauses. import shutil import os import os.path from tempfile import mkdtemp import pickle import warnings import io import sys import time import nose from joblib.memory import Memory, MemorizedFunc, NotMemorizedFunc, MemorizedResult from joblib.memory import NotMemorizedResult, _FUNCTION_HASHES from joblib.test.common import with_numpy, np from joblib.testing import assert_raises_regex from joblib._compat import PY3_OR_LATER ############################################################################### # Module-level variables for the tests def f(x, y=1): """ A module-level function for testing purposes. """ return x ** 2 + y ############################################################################### # Test fixtures env = dict() def setup_module(): """ Test setup. """ cachedir = mkdtemp() env['dir'] = cachedir if os.path.exists(cachedir): shutil.rmtree(cachedir) # Don't make the cachedir, Memory should be able to do that on the fly print(80 * '_') print('test_memory setup (%s)' % env['dir']) print(80 * '_') def _rmtree_onerror(func, path, excinfo): print('!' * 79) print('os function failed: %r' % func) print('file to be removed: %s' % path) print('exception was: %r' % excinfo[1]) print('!' * 79) def teardown_module(): """ Test teardown. """ shutil.rmtree(env['dir'], False, _rmtree_onerror) print(80 * '_') print('test_memory teardown (%s)' % env['dir']) print(80 * '_') ############################################################################### # Helper function for the tests def check_identity_lazy(func, accumulator): """ Given a function and an accumulator (a list that grows every time the function is called), check that the function can be decorated by memory to be a lazy identity. """ # Call each function with several arguments, and check that it is # evaluated only once per argument. memory = Memory(cachedir=env['dir'], verbose=0) memory.clear(warn=False) func = memory.cache(func) for i in range(3): for _ in range(2): yield nose.tools.assert_equal, func(i), i yield nose.tools.assert_equal, len(accumulator), i + 1 ############################################################################### # Tests def test_memory_integration(): """ Simple test of memory lazy evaluation. """ accumulator = list() # Rmk: this function has the same name than a module-level function, # thus it serves as a test to see that both are identified # as different. def f(l): accumulator.append(1) return l for test in check_identity_lazy(f, accumulator): yield test # Now test clearing for compress in (False, True): for mmap_mode in ('r', None): # We turn verbosity on to smoke test the verbosity code, however, # we capture it, as it is ugly try: # To smoke-test verbosity, we capture stdout orig_stdout = sys.stdout orig_stderr = sys.stdout if PY3_OR_LATER: sys.stderr = io.StringIO() sys.stderr = io.StringIO() else: sys.stdout = io.BytesIO() sys.stderr = io.BytesIO() memory = Memory(cachedir=env['dir'], verbose=10, mmap_mode=mmap_mode, compress=compress) # First clear the cache directory, to check that our code can # handle that # NOTE: this line would raise an exception, as the database file is # still open; we ignore the error since we want to test what # happens if the directory disappears shutil.rmtree(env['dir'], ignore_errors=True) g = memory.cache(f) g(1) g.clear(warn=False) current_accumulator = len(accumulator) out = g(1) finally: sys.stdout = orig_stdout sys.stderr = orig_stderr yield nose.tools.assert_equal, len(accumulator), \ current_accumulator + 1 # Also, check that Memory.eval works similarly yield nose.tools.assert_equal, memory.eval(f, 1), out yield nose.tools.assert_equal, len(accumulator), \ current_accumulator + 1 # Now do a smoke test with a function defined in __main__, as the name # mangling rules are more complex f.__module__ = '__main__' memory = Memory(cachedir=env['dir'], verbose=0) memory.cache(f)(1) def test_no_memory(): """ Test memory with cachedir=None: no memoize """ accumulator = list() def ff(l): accumulator.append(1) return l mem = Memory(cachedir=None, verbose=0) gg = mem.cache(ff) for _ in range(4): current_accumulator = len(accumulator) gg(1) yield nose.tools.assert_equal, len(accumulator), \ current_accumulator + 1 def test_memory_kwarg(): " Test memory with a function with keyword arguments." accumulator = list() def g(l=None, m=1): accumulator.append(1) return l for test in check_identity_lazy(g, accumulator): yield test memory = Memory(cachedir=env['dir'], verbose=0) g = memory.cache(g) # Smoke test with an explicit keyword argument: nose.tools.assert_equal(g(l=30, m=2), 30) def test_memory_lambda(): " Test memory with a function with a lambda." accumulator = list() def helper(x): """ A helper function to define l as a lambda. """ accumulator.append(1) return x l = lambda x: helper(x) for test in check_identity_lazy(l, accumulator): yield test def test_memory_name_collision(): " Check that name collisions with functions will raise warnings" memory = Memory(cachedir=env['dir'], verbose=0) @memory.cache def name_collision(x): """ A first function called name_collision """ return x a = name_collision @memory.cache def name_collision(x): """ A second function called name_collision """ return x b = name_collision if not hasattr(warnings, 'catch_warnings'): # catch_warnings is new in Python 2.6 return with warnings.catch_warnings(record=True) as w: # Cause all warnings to always be triggered. warnings.simplefilter("always") # This is a temporary workaround until we get rid of # inspect.getargspec, see # https://github.com/joblib/joblib/issues/247 warnings.simplefilter("ignore", DeprecationWarning) a(1) b(1) yield nose.tools.assert_equal, len(w), 1 yield nose.tools.assert_true, "collision" in str(w[-1].message) def test_memory_warning_lambda_collisions(): # Check that multiple use of lambda will raise collisions memory = Memory(cachedir=env['dir'], verbose=0) # For isolation with other tests memory.clear() a = lambda x: x a = memory.cache(a) b = lambda x: x + 1 b = memory.cache(b) with warnings.catch_warnings(record=True) as w: # Cause all warnings to always be triggered. warnings.simplefilter("always") # This is a temporary workaround until we get rid of # inspect.getargspec, see # https://github.com/joblib/joblib/issues/247 warnings.simplefilter("ignore", DeprecationWarning) nose.tools.assert_equal(0, a(0)) nose.tools.assert_equal(2, b(1)) nose.tools.assert_equal(1, a(1)) # In recent Python versions, we can retrieve the code of lambdas, # thus nothing is raised nose.tools.assert_equal(len(w), 4) def test_memory_warning_collision_detection(): # Check that collisions impossible to detect will raise appropriate # warnings. memory = Memory(cachedir=env['dir'], verbose=0) # For isolation with other tests memory.clear() a1 = eval('lambda x: x') a1 = memory.cache(a1) b1 = eval('lambda x: x+1') b1 = memory.cache(b1) if not hasattr(warnings, 'catch_warnings'): # catch_warnings is new in Python 2.6 return with warnings.catch_warnings(record=True) as w: # Cause all warnings to always be triggered. warnings.simplefilter("always") # This is a temporary workaround until we get rid of # inspect.getargspec, see # https://github.com/joblib/joblib/issues/247 warnings.simplefilter("ignore", DeprecationWarning) a1(1) b1(1) a1(0) yield nose.tools.assert_equal, len(w), 2 yield nose.tools.assert_true, \ "cannot detect" in str(w[-1].message).lower() def test_memory_partial(): " Test memory with functools.partial." accumulator = list() def func(x, y): """ A helper function to define l as a lambda. """ accumulator.append(1) return y import functools function = functools.partial(func, 1) for test in check_identity_lazy(function, accumulator): yield test def test_memory_eval(): " Smoke test memory with a function with a function defined in an eval." memory = Memory(cachedir=env['dir'], verbose=0) m = eval('lambda x: x') mm = memory.cache(m) yield nose.tools.assert_equal, 1, mm(1) def count_and_append(x=[]): """ A function with a side effect in its arguments. Return the lenght of its argument and append one element. """ len_x = len(x) x.append(None) return len_x def test_argument_change(): """ Check that if a function has a side effect in its arguments, it should use the hash of changing arguments. """ mem = Memory(cachedir=env['dir'], verbose=0) func = mem.cache(count_and_append) # call the function for the first time, is should cache it with # argument x=[] assert func() == 0 # the second time the argument is x=[None], which is not cached # yet, so the functions should be called a second time assert func() == 1 @with_numpy def test_memory_numpy(): " Test memory with a function with numpy arrays." # Check with memmapping and without. for mmap_mode in (None, 'r'): accumulator = list() def n(l=None): accumulator.append(1) return l memory = Memory(cachedir=env['dir'], mmap_mode=mmap_mode, verbose=0) memory.clear(warn=False) cached_n = memory.cache(n) rnd = np.random.RandomState(0) for i in range(3): a = rnd.random_sample((10, 10)) for _ in range(3): yield nose.tools.assert_true, np.all(cached_n(a) == a) yield nose.tools.assert_equal, len(accumulator), i + 1 @with_numpy def test_memory_numpy_check_mmap_mode(): """Check that mmap_mode is respected even at the first call""" memory = Memory(cachedir=env['dir'], mmap_mode='r', verbose=0) memory.clear(warn=False) @memory.cache() def twice(a): return a * 2 a = np.ones(3) b = twice(a) c = twice(a) nose.tools.assert_true(isinstance(c, np.memmap)) nose.tools.assert_equal(c.mode, 'r') nose.tools.assert_true(isinstance(b, np.memmap)) nose.tools.assert_equal(b.mode, 'r') def test_memory_exception(): """ Smoketest the exception handling of Memory. """ memory = Memory(cachedir=env['dir'], verbose=0) class MyException(Exception): pass @memory.cache def h(exc=0): if exc: raise MyException # Call once, to initialise the cache h() for _ in range(3): # Call 3 times, to be sure that the Exception is always raised yield nose.tools.assert_raises, MyException, h, 1 def test_memory_ignore(): " Test the ignore feature of memory " memory = Memory(cachedir=env['dir'], verbose=0) accumulator = list() @memory.cache(ignore=['y']) def z(x, y=1): accumulator.append(1) yield nose.tools.assert_equal, z.ignore, ['y'] z(0, y=1) yield nose.tools.assert_equal, len(accumulator), 1 z(0, y=1) yield nose.tools.assert_equal, len(accumulator), 1 z(0, y=2) yield nose.tools.assert_equal, len(accumulator), 1 def test_partial_decoration(): "Check cache may be called with kwargs before decorating" memory = Memory(cachedir=env['dir'], verbose=0) test_values = [ (['x'], 100, 'r'), ([], 10, None), ] for ignore, verbose, mmap_mode in test_values: @memory.cache(ignore=ignore, verbose=verbose, mmap_mode=mmap_mode) def z(x): pass yield nose.tools.assert_equal, z.ignore, ignore yield nose.tools.assert_equal, z._verbose, verbose yield nose.tools.assert_equal, z.mmap_mode, mmap_mode def test_func_dir(): # Test the creation of the memory cache directory for the function. memory = Memory(cachedir=env['dir'], verbose=0) memory.clear() path = __name__.split('.') path.append('f') path = os.path.join(env['dir'], 'joblib', *path) g = memory.cache(f) # Test that the function directory is created on demand yield nose.tools.assert_equal, g._get_func_dir(), path yield nose.tools.assert_true, os.path.exists(path) # Test that the code is stored. # For the following test to be robust to previous execution, we clear # the in-memory store _FUNCTION_HASHES.clear() yield nose.tools.assert_false, \ g._check_previous_func_code() yield nose.tools.assert_true, \ os.path.exists(os.path.join(path, 'func_code.py')) yield nose.tools.assert_true, \ g._check_previous_func_code() # Test the robustness to failure of loading previous results. dir, _ = g.get_output_dir(1) a = g(1) yield nose.tools.assert_true, os.path.exists(dir) os.remove(os.path.join(dir, 'output.pkl')) yield nose.tools.assert_equal, a, g(1) def test_persistence(): # Test the memorized functions can be pickled and restored. memory = Memory(cachedir=env['dir'], verbose=0) g = memory.cache(f) output = g(1) h = pickle.loads(pickle.dumps(g)) output_dir, _ = g.get_output_dir(1) yield nose.tools.assert_equal, output, h.load_output(output_dir) memory2 = pickle.loads(pickle.dumps(memory)) yield nose.tools.assert_equal, memory.cachedir, memory2.cachedir # Smoke test that pickling a memory with cachedir=None works memory = Memory(cachedir=None, verbose=0) pickle.loads(pickle.dumps(memory)) g = memory.cache(f) gp = pickle.loads(pickle.dumps(g)) gp(1) def test_call_and_shelve(): """Test MemorizedFunc outputting a reference to cache. """ for func, Result in zip((MemorizedFunc(f, env['dir']), NotMemorizedFunc(f), Memory(cachedir=env['dir']).cache(f), Memory(cachedir=None).cache(f), ), (MemorizedResult, NotMemorizedResult, MemorizedResult, NotMemorizedResult)): nose.tools.assert_equal(func(2), 5) result = func.call_and_shelve(2) nose.tools.assert_true(isinstance(result, Result)) nose.tools.assert_equal(result.get(), 5) result.clear() nose.tools.assert_raises(KeyError, result.get) result.clear() # Do nothing if there is no cache. def test_memorized_pickling(): for func in (MemorizedFunc(f, env['dir']), NotMemorizedFunc(f)): filename = os.path.join(env['dir'], 'pickling_test.dat') result = func.call_and_shelve(2) with open(filename, 'wb') as fp: pickle.dump(result, fp) with open(filename, 'rb') as fp: result2 = pickle.load(fp) nose.tools.assert_equal(result2.get(), result.get()) os.remove(filename) def test_memorized_repr(): func = MemorizedFunc(f, env['dir']) result = func.call_and_shelve(2) func2 = MemorizedFunc(f, env['dir']) result2 = func2.call_and_shelve(2) nose.tools.assert_equal(result.get(), result2.get()) nose.tools.assert_equal(repr(func), repr(func2)) # Smoke test on deprecated methods func.format_signature(2) func.format_call(2) # Smoke test with NotMemorizedFunc func = NotMemorizedFunc(f) repr(func) repr(func.call_and_shelve(2)) # Smoke test for message output (increase code coverage) func = MemorizedFunc(f, env['dir'], verbose=11, timestamp=time.time()) result = func.call_and_shelve(11) result.get() func = MemorizedFunc(f, env['dir'], verbose=11) result = func.call_and_shelve(11) result.get() func = MemorizedFunc(f, env['dir'], verbose=5, timestamp=time.time()) result = func.call_and_shelve(11) result.get() func = MemorizedFunc(f, env['dir'], verbose=5) result = func.call_and_shelve(11) result.get() def test_memory_file_modification(): # Test that modifying a Python file after loading it does not lead to # Recomputation dir_name = os.path.join(env['dir'], 'tmp_import') if not os.path.exists(dir_name): os.mkdir(dir_name) filename = os.path.join(dir_name, 'tmp_joblib_.py') content = 'def f(x):\n print(x)\n return x\n' with open(filename, 'w') as module_file: module_file.write(content) # Load the module: sys.path.append(dir_name) import tmp_joblib_ as tmp mem = Memory(cachedir=env['dir'], verbose=0) f = mem.cache(tmp.f) # Capture sys.stdout to count how many time f is called orig_stdout = sys.stdout if PY3_OR_LATER: my_stdout = io.StringIO() else: my_stdout = io.BytesIO() try: sys.stdout = my_stdout # First call f a few times f(1) f(2) f(1) # Now modify the module where f is stored without modifying f with open(filename, 'w') as module_file: module_file.write('\n\n' + content) # And call f a couple more times f(1) f(1) # Flush the .pyc files shutil.rmtree(dir_name) os.mkdir(dir_name) # Now modify the module where f is stored, modifying f content = 'def f(x):\n print("x=%s" % x)\n return x\n' with open(filename, 'w') as module_file: module_file.write(content) # And call f more times prior to reloading: the cache should not be # invalidated at this point as the active function definition has not # changed in memory yet. f(1) f(1) # Now reload my_stdout.write('Reloading\n') sys.modules.pop('tmp_joblib_') import tmp_joblib_ as tmp f = mem.cache(tmp.f) # And call f more times f(1) f(1) finally: sys.stdout = orig_stdout nose.tools.assert_equal(my_stdout.getvalue(), '1\n2\nReloading\nx=1\n') def _function_to_cache(a, b): # Just a place holder function to be mutated by tests pass def _sum(a, b): return a + b def _product(a, b): return a * b def test_memory_in_memory_function_code_change(): _function_to_cache.__code__ = _sum.__code__ mem = Memory(cachedir=env['dir'], verbose=0) f = mem.cache(_function_to_cache) nose.tools.assert_equal(f(1, 2), 3) nose.tools.assert_equal(f(1, 2), 3) with warnings.catch_warnings(record=True): # ignore name collision warnings warnings.simplefilter("always") # Check that inline function modification triggers a cache invalidation _function_to_cache.__code__ = _product.__code__ nose.tools.assert_equal(f(1, 2), 2) nose.tools.assert_equal(f(1, 2), 2) def test_clear_memory_with_none_cachedir(): mem = Memory(cachedir=None) mem.clear() if PY3_OR_LATER: exec(""" def func_with_kwonly_args(a, b, *, kw1='kw1', kw2='kw2'): return a, b, kw1, kw2 def func_with_signature(a: int, b: float) -> float: return a + b """) def test_memory_func_with_kwonly_args(): mem = Memory(cachedir=env['dir'], verbose=0) func_cached = mem.cache(func_with_kwonly_args) nose.tools.assert_equal(func_cached(1, 2, kw1=3), (1, 2, 3, 'kw2')) # Making sure that providing a keyword-only argument by # position raises an exception assert_raises_regex( ValueError, "Keyword-only parameter 'kw1' was passed as positional parameter", func_cached, 1, 2, 3, {'kw2': 4}) # Keyword-only parameter passed by position with cached call # should still raise ValueError func_cached(1, 2, kw1=3, kw2=4) assert_raises_regex( ValueError, "Keyword-only parameter 'kw1' was passed as positional parameter", func_cached, 1, 2, 3, {'kw2': 4}) # Test 'ignore' parameter func_cached = mem.cache(func_with_kwonly_args, ignore=['kw2']) nose.tools.assert_equal(func_cached(1, 2, kw1=3, kw2=4), (1, 2, 3, 4)) nose.tools.assert_equal(func_cached(1, 2, kw1=3, kw2='ignored'), (1, 2, 3, 4)) def test_memory_func_with_signature(): mem = Memory(cachedir=env['dir'], verbose=0) func_cached = mem.cache(func_with_signature) nose.tools.assert_equal(func_cached(1, 2.), 3.) joblib-0.9.4/joblib/test/test_my_exceptions.py000066400000000000000000000047171264716474700215370ustar00rootroot00000000000000""" Test my automatically generate exceptions """ from nose.tools import assert_true from joblib import my_exceptions class CustomException(Exception): def __init__(self, a, b, c, d): self.a, self.b, self.c, self.d = a, b, c, d class CustomException2(Exception): """A custom exception with a .args attribute Just to check that the JoblibException created from it has it args set correctly """ def __init__(self, a, *args): self.a = a self.args = args def test_inheritance(): assert_true(isinstance(my_exceptions.JoblibNameError(), NameError)) assert_true(isinstance(my_exceptions.JoblibNameError(), my_exceptions.JoblibException)) assert_true(my_exceptions.JoblibNameError is my_exceptions._mk_exception(NameError)[0]) def test_inheritance_special_cases(): # _mk_exception should transform Exception to JoblibException assert_true(my_exceptions._mk_exception(Exception)[0] is my_exceptions.JoblibException) # Subclasses of JoblibException should be mapped to # JoblibException by _mk_exception for exception in [my_exceptions.JoblibException, my_exceptions.TransportableException]: assert_true(my_exceptions._mk_exception(exception)[0] is my_exceptions.JoblibException) # Non-inheritable exception classes should be mapped to # JoblibException by _mk_exception. That can happen with classes # generated with SWIG. See # https://github.com/joblib/joblib/issues/269 for a concrete # example. non_inheritable_classes = [type(lambda: None), bool] for exception in non_inheritable_classes: assert_true(my_exceptions._mk_exception(exception)[0] is my_exceptions.JoblibException) def test__mk_exception(): # Check that _mk_exception works on a bunch of different exceptions for klass in (Exception, TypeError, SyntaxError, ValueError, AssertionError, CustomException, CustomException2): message = 'This message should be in the exception repr' exc = my_exceptions._mk_exception(klass)[0]( message, 'some', 'other', 'args', 'that are not', 'in the repr') exc_repr = repr(exc) assert_true(isinstance(exc, klass)) assert_true(isinstance(exc, my_exceptions.JoblibException)) assert_true(exc.__class__.__name__ in exc_repr) assert_true(message in exc_repr) joblib-0.9.4/joblib/test/test_numpy_pickle.py000066400000000000000000000324531264716474700213460ustar00rootroot00000000000000""" Test the numpy pickler as a replacement of the standard pickler. """ from tempfile import mkdtemp import copy import shutil import os import random import sys import re import tempfile import glob import nose from joblib.test.common import np, with_numpy # numpy_pickle is not a drop-in replacement of pickle, as it takes # filenames instead of open files as arguments. from joblib import numpy_pickle from joblib.test import data ############################################################################### # Define a list of standard types. # Borrowed from dill, initial author: Micheal McKerns: # http://dev.danse.us/trac/pathos/browser/dill/dill_test2.py typelist = [] # testing types _none = None typelist.append(_none) _type = type typelist.append(_type) _bool = bool(1) typelist.append(_bool) _int = int(1) typelist.append(_int) try: _long = long(1) typelist.append(_long) except NameError: # long is not defined in python 3 pass _float = float(1) typelist.append(_float) _complex = complex(1) typelist.append(_complex) _string = str(1) typelist.append(_string) try: _unicode = unicode(1) typelist.append(_unicode) except NameError: # unicode is not defined in python 3 pass _tuple = () typelist.append(_tuple) _list = [] typelist.append(_list) _dict = {} typelist.append(_dict) try: _file = file typelist.append(_file) except NameError: pass # file does not exists in Python 3 try: _buffer = buffer typelist.append(_buffer) except NameError: # buffer does not exists in Python 3 pass _builtin = len typelist.append(_builtin) def _function(x): yield x class _class: def _method(self): pass class _newclass(object): def _method(self): pass typelist.append(_function) typelist.append(_class) typelist.append(_newclass) # _instance = _class() typelist.append(_instance) _object = _newclass() typelist.append(_object) # ############################################################################### # Test fixtures env = dict() def setup_module(): """ Test setup. """ env['dir'] = mkdtemp() env['filename'] = os.path.join(env['dir'], 'test.pkl') print(80 * '_') print('setup numpy_pickle') print(80 * '_') def teardown_module(): """ Test teardown. """ shutil.rmtree(env['dir']) #del env['dir'] #del env['filename'] print(80 * '_') print('teardown numpy_pickle') print(80 * '_') ############################################################################### # Tests def test_standard_types(): # Test pickling and saving with standard types. filename = env['filename'] for compress in [0, 1]: for member in typelist: # Change the file name to avoid side effects between tests this_filename = filename + str(random.randint(0, 1000)) numpy_pickle.dump(member, this_filename, compress=compress) _member = numpy_pickle.load(this_filename) # We compare the pickled instance to the reloaded one only if it # can be compared to a copied one if member == copy.deepcopy(member): yield nose.tools.assert_equal, member, _member def test_value_error(): # Test inverting the input arguments to dump nose.tools.assert_raises(ValueError, numpy_pickle.dump, 'foo', dict()) @with_numpy def test_numpy_persistence(): filename = env['filename'] rnd = np.random.RandomState(0) a = rnd.random_sample((10, 2)) for compress, cache_size in ((0, 0), (1, 0), (1, 10)): # We use 'a.T' to have a non C-contiguous array. for index, obj in enumerate(((a,), (a.T,), (a, a), [a, a, a])): # Change the file name to avoid side effects between tests this_filename = filename + str(random.randint(0, 1000)) filenames = numpy_pickle.dump(obj, this_filename, compress=compress, cache_size=cache_size) # Check that one file was created per array if not compress: nose.tools.assert_equal(len(filenames), len(obj) + 1) # Check that these files do exist for file in filenames: nose.tools.assert_true( os.path.exists(os.path.join(env['dir'], file))) # Unpickle the object obj_ = numpy_pickle.load(this_filename) # Check that the items are indeed arrays for item in obj_: nose.tools.assert_true(isinstance(item, np.ndarray)) # And finally, check that all the values are equal. nose.tools.assert_true(np.all(np.array(obj) == np.array(obj_))) # Now test with array subclasses for obj in ( np.matrix(np.zeros(10)), np.core.multiarray._reconstruct(np.memmap, (), np.float) ): this_filename = filename + str(random.randint(0, 1000)) filenames = numpy_pickle.dump(obj, this_filename, compress=compress, cache_size=cache_size) obj_ = numpy_pickle.load(this_filename) if (type(obj) is not np.memmap and hasattr(obj, '__array_prepare__')): # We don't reconstruct memmaps nose.tools.assert_true(isinstance(obj_, type(obj))) # Finally smoke test the warning in case of compress + mmap_mode this_filename = filename + str(random.randint(0, 1000)) numpy_pickle.dump(a, this_filename, compress=1) numpy_pickle.load(this_filename, mmap_mode='r') @with_numpy def test_memmap_persistence(): rnd = np.random.RandomState(0) a = rnd.random_sample(10) filename = env['filename'] + str(random.randint(0, 1000)) numpy_pickle.dump(a, filename) b = numpy_pickle.load(filename, mmap_mode='r') nose.tools.assert_true(isinstance(b, np.memmap)) @with_numpy def test_memmap_persistence_mixed_dtypes(): # loading datastructures that have sub-arrays with dtype=object # should not prevent memmaping on fixed size dtype sub-arrays. rnd = np.random.RandomState(0) a = rnd.random_sample(10) b = np.array([1, 'b'], dtype=object) construct = (a, b) filename = env['filename'] + str(random.randint(0, 1000)) numpy_pickle.dump(construct, filename) a_clone, b_clone = numpy_pickle.load(filename, mmap_mode='r') # the floating point array has been memory mapped nose.tools.assert_true(isinstance(a_clone, np.memmap)) # the object-dtype array has been loaded in memory nose.tools.assert_false(isinstance(b_clone, np.memmap)) @with_numpy def test_masked_array_persistence(): # The special-case picker fails, because saving masked_array # not implemented, but it just delegates to the standard pickler. rnd = np.random.RandomState(0) a = rnd.random_sample(10) a = np.ma.masked_greater(a, 0.5) filename = env['filename'] + str(random.randint(0, 1000)) numpy_pickle.dump(a, filename) b = numpy_pickle.load(filename, mmap_mode='r') nose.tools.assert_true(isinstance(b, np.ma.masked_array)) def test_z_file(): # Test saving and loading data with Zfiles filename = env['filename'] + str(random.randint(0, 1000)) data = numpy_pickle.asbytes('Foo, \n Bar, baz, \n\nfoobar') with open(filename, 'wb') as f: numpy_pickle.write_zfile(f, data) with open(filename, 'rb') as f: data_read = numpy_pickle.read_zfile(f) nose.tools.assert_equal(data, data_read) @with_numpy def test_compressed_pickle_dump_and_load(): # XXX: temporarily disable this test on non little-endian machines if sys.byteorder != 'little': raise nose.SkipTest('Skipping this test on non little-endian machines') expected_list = [np.arange(5, dtype=np.dtype('= pickle_writing_protocol: try: result_list = numpy_pickle.load(filename) for result, expected in zip(result_list, expected_list): if isinstance(expected, np.ndarray): nose.tools.assert_equal(result.dtype, expected.dtype) np.testing.assert_equal(result, expected) else: nose.tools.assert_equal(result, expected) except Exception as exc: # When trying to read with python 3 a pickle generated # with python 2 we expect a user-friendly error if (py_version_used_for_reading == 3 and py_version_used_for_writing == 2): nose.tools.assert_true(isinstance(exc, ValueError)) message = ('You may be trying to read with ' 'python 3 a joblib pickle generated with python 2.') nose.tools.assert_true(message in str(exc)) else: raise else: # Pickle protocol used for writing is too high. We expect a # "unsupported pickle protocol" error message try: numpy_pickle.load(filename) raise AssertionError('Numpy pickle loading should ' 'have raised a ValueError exception') except ValueError as e: message = 'unsupported pickle protocol: {0}'.format( pickle_writing_protocol) nose.tools.assert_true(message in str(e.args)) @with_numpy def test_joblib_pickle_across_python_versions(): # XXX: temporarily disable this test on non little-endian machines if sys.byteorder != 'little': raise nose.SkipTest('Skipping this test on non little-endian machines') # We need to be specific about dtypes in particular endianness # because the pickles can be generated on one architecture and # the tests run on another one. See # https://github.com/joblib/joblib/issues/279. expected_list = [np.arange(5, dtype=np.dtype(' # Copyright (c) 2010-2011 Gael Varoquaux # License: BSD Style, 3 clauses. import time import sys import io import os from joblib.test.common import np, with_numpy from joblib.test.common import with_multiprocessing from joblib.testing import check_subprocess_call from joblib._compat import PY3_OR_LATER try: import cPickle as pickle PickleError = TypeError except: import pickle PickleError = pickle.PicklingError if PY3_OR_LATER: PickleError = pickle.PicklingError try: # Python 2/Python 3 compat unicode('str') except NameError: unicode = lambda s: s try: from queue import Queue except ImportError: # Backward compat from Queue import Queue from joblib.parallel import Parallel, delayed, SafeFunction, WorkerInterrupt from joblib.parallel import mp, cpu_count, VALID_BACKENDS from joblib.my_exceptions import JoblibException import nose from nose.tools import assert_equal, assert_true, assert_false, assert_raises ALL_VALID_BACKENDS = [None] + VALID_BACKENDS if hasattr(mp, 'get_context'): # Custom multiprocessing context in Python 3.4+ ALL_VALID_BACKENDS.append(mp.get_context('spawn')) def division(x, y): return x / y def square(x): return x ** 2 class MyExceptionWithFinickyInit(Exception): """An exception class with non trivial __init__ """ def __init__(self, a, b, c, d): pass def exception_raiser(x, custom_exception=False): if x == 7: raise (MyExceptionWithFinickyInit('a', 'b', 'c', 'd') if custom_exception else ValueError) return x def interrupt_raiser(x): time.sleep(.05) raise KeyboardInterrupt def f(x, y=0, z=0): """ A module-level function so that it can be spawn with multiprocessing. """ return x ** 2 + y + z ############################################################################### def test_cpu_count(): assert cpu_count() > 0 ############################################################################### # Test parallel def check_simple_parallel(backend): X = range(5) for n_jobs in (1, 2, -1, -2): assert_equal( [square(x) for x in X], Parallel(n_jobs=n_jobs)(delayed(square)(x) for x in X)) try: # To smoke-test verbosity, we capture stdout orig_stdout = sys.stdout orig_stderr = sys.stdout if PY3_OR_LATER: sys.stderr = io.StringIO() sys.stderr = io.StringIO() else: sys.stdout = io.BytesIO() sys.stderr = io.BytesIO() for verbose in (2, 11, 100): Parallel(n_jobs=-1, verbose=verbose, backend=backend)( delayed(square)(x) for x in X) Parallel(n_jobs=1, verbose=verbose, backend=backend)( delayed(square)(x) for x in X) Parallel(n_jobs=2, verbose=verbose, pre_dispatch=2, backend=backend)( delayed(square)(x) for x in X) Parallel(n_jobs=2, verbose=verbose, backend=backend)( delayed(square)(x) for x in X) except Exception as e: my_stdout = sys.stdout my_stderr = sys.stderr sys.stdout = orig_stdout sys.stderr = orig_stderr print(unicode(my_stdout.getvalue())) print(unicode(my_stderr.getvalue())) raise e finally: sys.stdout = orig_stdout sys.stderr = orig_stderr def test_simple_parallel(): for backend in ALL_VALID_BACKENDS: yield check_simple_parallel, backend def nested_loop(backend): Parallel(n_jobs=2, backend=backend)( delayed(square)(.01) for _ in range(2)) def check_nested_loop(parent_backend, child_backend): Parallel(n_jobs=2, backend=parent_backend)( delayed(nested_loop)(child_backend) for _ in range(2)) def test_nested_loop(): for parent_backend in VALID_BACKENDS: for child_backend in VALID_BACKENDS: yield check_nested_loop, parent_backend, child_backend def test_mutate_input_with_threads(): """Input is mutable when using the threading backend""" q = Queue(maxsize=5) Parallel(n_jobs=2, backend="threading")( delayed(q.put, check_pickle=False)(1) for _ in range(5)) assert_true(q.full()) def test_parallel_kwargs(): """Check the keyword argument processing of pmap.""" lst = range(10) for n_jobs in (1, 4): yield (assert_equal, [f(x, y=1) for x in lst], Parallel(n_jobs=n_jobs)(delayed(f)(x, y=1) for x in lst)) def check_parallel_context_manager(backend): lst = range(10) expected = [f(x, y=1) for x in lst] with Parallel(n_jobs=4, backend=backend) as p: # Internally a pool instance has been eagerly created and is managed # via the context manager protocol managed_pool = p._pool if mp is not None: assert_true(managed_pool is not None) # We make call with the managed parallel object several times inside # the managed block: assert_equal(expected, p(delayed(f)(x, y=1) for x in lst)) assert_equal(expected, p(delayed(f)(x, y=1) for x in lst)) # Those calls have all used the same pool instance: if mp is not None: assert_true(managed_pool is p._pool) # As soon as we exit the context manager block, the pool is terminated and # no longer referenced from the parallel object: assert_true(p._pool is None) # It's still possible to use the parallel instance in non-managed mode: assert_equal(expected, p(delayed(f)(x, y=1) for x in lst)) assert_true(p._pool is None) def test_parallel_context_manager(): for backend in ['multiprocessing', 'threading']: yield check_parallel_context_manager, backend def test_parallel_pickling(): """ Check that pmap captures the errors when it is passed an object that cannot be pickled. """ def g(x): return x ** 2 try: # pickling a local function always fail but the exception # raised is a PickleError for python <= 3.4 and AttributeError # for python >= 3.5 pickle.dumps(g) except Exception as exc: exception_class = exc.__class__ assert_raises(exception_class, Parallel(), (delayed(g)(x) for x in range(10))) def test_error_capture(): # Check that error are captured, and that correct exceptions # are raised. if mp is not None: # A JoblibException will be raised only if there is indeed # multiprocessing assert_raises(JoblibException, Parallel(n_jobs=2), [delayed(division)(x, y) for x, y in zip((0, 1), (1, 0))]) assert_raises(WorkerInterrupt, Parallel(n_jobs=2), [delayed(interrupt_raiser)(x) for x in (1, 0)]) # Try again with the context manager API with Parallel(n_jobs=2) as parallel: assert_true(parallel._pool is not None) assert_raises(JoblibException, parallel, [delayed(division)(x, y) for x, y in zip((0, 1), (1, 0))]) # The managed pool should still be available and be in a working # state despite the previously raised (and caught) exception assert_true(parallel._pool is not None) assert_equal([f(x, y=1) for x in range(10)], parallel(delayed(f)(x, y=1) for x in range(10))) assert_raises(WorkerInterrupt, parallel, [delayed(interrupt_raiser)(x) for x in (1, 0)]) # The pool should still be available despite the exception assert_true(parallel._pool is not None) assert_equal([f(x, y=1) for x in range(10)], parallel(delayed(f)(x, y=1) for x in range(10))) # Check that the inner pool has been terminated when exiting the # context manager assert_true(parallel._pool is None) else: assert_raises(KeyboardInterrupt, Parallel(n_jobs=2), [delayed(interrupt_raiser)(x) for x in (1, 0)]) # wrapped exceptions should inherit from the class of the original # exception to make it easy to catch them assert_raises(ZeroDivisionError, Parallel(n_jobs=2), [delayed(division)(x, y) for x, y in zip((0, 1), (1, 0))]) assert_raises( MyExceptionWithFinickyInit, Parallel(n_jobs=2, verbose=0), (delayed(exception_raiser)(i, custom_exception=True) for i in range(30))) try: # JoblibException wrapping is disabled in sequential mode: ex = JoblibException() Parallel(n_jobs=1)( delayed(division)(x, y) for x, y in zip((0, 1), (1, 0))) except Exception as ex: assert_false(isinstance(ex, JoblibException)) class Counter(object): def __init__(self, list1, list2): self.list1 = list1 self.list2 = list2 def __call__(self, i): self.list1.append(i) assert_equal(len(self.list1), len(self.list2)) def consumer(queue, item): queue.append('Consumed %s' % item) def check_dispatch_one_job(backend): """ Test that with only one job, Parallel does act as a iterator. """ queue = list() def producer(): for i in range(6): queue.append('Produced %i' % i) yield i # disable batching Parallel(n_jobs=1, batch_size=1, backend=backend)( delayed(consumer)(queue, x) for x in producer()) assert_equal(queue, [ 'Produced 0', 'Consumed 0', 'Produced 1', 'Consumed 1', 'Produced 2', 'Consumed 2', 'Produced 3', 'Consumed 3', 'Produced 4', 'Consumed 4', 'Produced 5', 'Consumed 5', ]) assert_equal(len(queue), 12) # empty the queue for the next check queue[:] = [] # enable batching Parallel(n_jobs=1, batch_size=4, backend=backend)( delayed(consumer)(queue, x) for x in producer()) assert_equal(queue, [ # First batch 'Produced 0', 'Produced 1', 'Produced 2', 'Produced 3', 'Consumed 0', 'Consumed 1', 'Consumed 2', 'Consumed 3', # Second batch 'Produced 4', 'Produced 5', 'Consumed 4', 'Consumed 5', ]) assert_equal(len(queue), 12) def test_dispatch_one_job(): for backend in VALID_BACKENDS: yield check_dispatch_one_job, backend def check_dispatch_multiprocessing(backend): """ Check that using pre_dispatch Parallel does indeed dispatch items lazily. """ if mp is None: raise nose.SkipTest() manager = mp.Manager() queue = manager.list() def producer(): for i in range(6): queue.append('Produced %i' % i) yield i Parallel(n_jobs=2, batch_size=1, pre_dispatch=3, backend=backend)( delayed(consumer)(queue, 'any') for _ in producer()) # Only 3 tasks are dispatched out of 6. The 4th task is dispatched only # after any of the first 3 jobs have completed. first_four = list(queue)[:4] # The the first consumption event can sometimes happen before the end of # the dispatching, hence, pop it before introspecting the "Produced" events first_four.remove('Consumed any') assert_equal(first_four, ['Produced 0', 'Produced 1', 'Produced 2']) assert_equal(len(queue), 12) def test_dispatch_multiprocessing(): for backend in VALID_BACKENDS: yield check_dispatch_multiprocessing, backend def test_batching_auto_threading(): # batching='auto' with the threading backend leaves the effective batch # size to 1 (no batching) as it has been found to never be beneficial with # this low-overhead backend. p = Parallel(n_jobs=2, batch_size='auto', backend='threading') p(delayed(id)(i) for i in range(5000)) # many very fast tasks assert_equal(p._effective_batch_size, 1) def test_batching_auto_multiprocessing(): p = Parallel(n_jobs=2, batch_size='auto', backend='multiprocessing') p(delayed(id)(i) for i in range(5000)) # many very fast tasks # When the auto-tuning of the batch size is enabled # size kicks in the following attribute gets updated. assert_true(hasattr(p, '_effective_batch_size')) # It should be strictly larger than 1 but as we don't want heisen failures # on clogged CI worker environment be safe and only check that it's a # strictly positive number. assert_true(p._effective_batch_size > 0) def test_exception_dispatch(): "Make sure that exception raised during dispatch are indeed captured" assert_raises( ValueError, Parallel(n_jobs=2, pre_dispatch=16, verbose=0), (delayed(exception_raiser)(i) for i in range(30)), ) def test_nested_exception_dispatch(): # Ensure TransportableException objects for nested joblib cases gets # propagated. assert_raises( JoblibException, Parallel(n_jobs=2, pre_dispatch=16, verbose=0), (delayed(SafeFunction(exception_raiser))(i) for i in range(30))) def _reload_joblib(): # Retrieve the path of the parallel module in a robust way joblib_path = Parallel.__module__.split(os.sep) joblib_path = joblib_path[:1] joblib_path.append('parallel.py') joblib_path = '/'.join(joblib_path) module = __import__(joblib_path) # Reload the module. This should trigger a fail reload(module) def test_multiple_spawning(): # Test that attempting to launch a new Python after spawned # subprocesses will raise an error, to avoid infinite loops on # systems that do not support fork if not int(os.environ.get('JOBLIB_MULTIPROCESSING', 1)): raise nose.SkipTest() assert_raises(ImportError, Parallel(n_jobs=2, pre_dispatch='all'), [delayed(_reload_joblib)() for i in range(10)]) ############################################################################### # Test helpers def test_joblib_exception(): # Smoke-test the custom exception e = JoblibException('foobar') # Test the repr repr(e) # Test the pickle pickle.dumps(e) def test_safe_function(): safe_division = SafeFunction(division) assert_raises(JoblibException, safe_division, 1, 0) def test_invalid_batch_size(): assert_raises(ValueError, Parallel, batch_size=0) assert_raises(ValueError, Parallel, batch_size=-1) assert_raises(ValueError, Parallel, batch_size=1.42) def check_same_results(params): n_tasks = params.pop('n_tasks') expected = [square(i) for i in range(n_tasks)] results = Parallel(**params)(delayed(square)(i) for i in range(n_tasks)) assert_equal(results, expected) def test_dispatch_race_condition(): # Check that using (async-)dispatch does not yield a race condition on the # iterable generator that is not thread-safe natively. # This is a non-regression test for the "Pool seems closed" class of error yield check_same_results, dict(n_tasks=2, n_jobs=2, pre_dispatch="all") yield check_same_results, dict(n_tasks=2, n_jobs=2, pre_dispatch="n_jobs") yield check_same_results, dict(n_tasks=10, n_jobs=2, pre_dispatch="n_jobs") yield check_same_results, dict(n_tasks=517, n_jobs=2, pre_dispatch="n_jobs") yield check_same_results, dict(n_tasks=10, n_jobs=2, pre_dispatch="n_jobs") yield check_same_results, dict(n_tasks=10, n_jobs=4, pre_dispatch="n_jobs") yield check_same_results, dict(n_tasks=25, n_jobs=4, batch_size=1) yield check_same_results, dict(n_tasks=25, n_jobs=4, batch_size=1, pre_dispatch="all") yield check_same_results, dict(n_tasks=25, n_jobs=4, batch_size=7) yield check_same_results, dict(n_tasks=10, n_jobs=4, pre_dispatch="2*n_jobs") def test_default_mp_context(): p = Parallel(n_jobs=2, backend='multiprocessing') if sys.version_info >= (3, 4): # Under Python 3.4+ the multiprocessing context can be configured # by an environment variable env_method = os.environ.get('JOBLIB_START_METHOD', '').strip() or None if env_method is None: # Check the default behavior if sys.platform == 'win32': assert_equal(p._mp_context.get_start_method(), 'spawn') else: assert_equal(p._mp_context.get_start_method(), 'fork') else: assert_equal(p._mp_context.get_start_method(), env_method) else: assert_equal(p._mp_context, None) @with_numpy def test_no_blas_crash_or_freeze_with_multiprocessing(): if sys.version_info < (3, 4): raise nose.SkipTest('multiprocessing can cause BLAS freeze on' ' old Python') # Use the spawn backend that is both robust and available on all platforms spawn_backend = mp.get_context('spawn') # Check that on recent Python version, the 'spawn' start method can make # it possible to use multiprocessing in conjunction of any BLAS # implementation that happens to be used by numpy with causing a freeze or # a crash rng = np.random.RandomState(42) # call BLAS DGEMM to force the initialization of the internal thread-pool # in the main process a = rng.randn(1000, 1000) np.dot(a, a.T) # check that the internal BLAS thread-pool is not in an inconsistent state # in the worker processes managed by multiprocessing Parallel(n_jobs=2, backend=spawn_backend)( delayed(np.dot)(a, a.T) for i in range(2)) def test_parallel_with_interactively_defined_functions(): # When functions are defined interactively in a python/IPython # session, we want to be able to use them with joblib.Parallel try: import posix except ImportError: # This test pass only when fork is the process start method raise nose.SkipTest('Not a POSIX platform') code = '\n\n'.join([ 'from joblib import Parallel, delayed', 'def sqrt(x): return x**2', 'print(Parallel(n_jobs=2)(delayed(sqrt)(i) for i in range(5)))']) check_subprocess_call([sys.executable, '-c', code], stdout_regex=r'\[0, 1, 4, 9, 16\]') def test_parallel_with_exhausted_iterator(): exhausted_iterator = iter([]) assert_equal(Parallel(n_jobs=2)(exhausted_iterator), []) def check_memmap(a): if not isinstance(a, np.memmap): raise TypeError('Expected np.memmap instance, got %r', type(a)) return a.copy() # return a regular array instead of a memmap @with_numpy @with_multiprocessing def test_auto_memmap_on_arrays_from_generator(): # Non-regression test for a problem with a bad interaction between the # GC collecting arrays recently created during iteration inside the # parallel dispatch loop and the auto-memmap feature of Parallel. # See: https://github.com/joblib/joblib/pull/294 def generate_arrays(n): for i in range(n): yield np.ones(10, dtype=np.float32) * i # Use max_nbytes=1 to force the use of memory-mapping even for small # arrays results = Parallel(n_jobs=2, max_nbytes=1)( delayed(check_memmap)(a) for a in generate_arrays(100)) for result, expected in zip(results, generate_arrays(len(results))): np.testing.assert_array_equal(expected, result) joblib-0.9.4/joblib/test/test_pool.py000066400000000000000000000370161264716474700176200ustar00rootroot00000000000000import os import shutil import tempfile from nose.tools import with_setup from nose.tools import assert_equal from nose.tools import assert_raises from nose.tools import assert_false from nose.tools import assert_true from joblib.test.common import with_numpy, np from joblib.test.common import setup_autokill from joblib.test.common import teardown_autokill from joblib.test.common import with_multiprocessing from joblib.test.common import with_dev_shm from joblib._multiprocessing_helpers import mp if mp is not None: from joblib.pool import MemmapingPool from joblib.pool import has_shareable_memory from joblib.pool import ArrayMemmapReducer from joblib.pool import reduce_memmap TEMP_FOLDER = None def setup_module(): setup_autokill(__name__, timeout=300) def teardown_module(): teardown_autokill(__name__) def setup_temp_folder(): global TEMP_FOLDER TEMP_FOLDER = tempfile.mkdtemp(prefix='joblib_test_pool_') def teardown_temp_folder(): global TEMP_FOLDER if TEMP_FOLDER is not None: shutil.rmtree(TEMP_FOLDER) TEMP_FOLDER = None with_temp_folder = with_setup(setup_temp_folder, teardown_temp_folder) def check_array(args): """Dummy helper function to be executed in subprocesses Check that the provided array has the expected values in the provided range. """ data, position, expected = args np.testing.assert_array_equal(data[position], expected) def inplace_double(args): """Dummy helper function to be executed in subprocesses Check that the input array has the right values in the provided range and perform an inplace modification to double the values in the range by two. """ data, position, expected = args assert_equal(data[position], expected) data[position] *= 2 np.testing.assert_array_equal(data[position], 2 * expected) @with_numpy @with_multiprocessing @with_temp_folder def test_memmap_based_array_reducing(): """Check that it is possible to reduce a memmap backed array""" assert_array_equal = np.testing.assert_array_equal filename = os.path.join(TEMP_FOLDER, 'test.mmap') # Create a file larger than what will be used by a buffer = np.memmap(filename, dtype=np.float64, shape=500, mode='w+') # Fill the original buffer with negative markers to detect over of # underflow in case of test failures buffer[:] = - 1.0 * np.arange(buffer.shape[0], dtype=buffer.dtype) buffer.flush() # Memmap a 2D fortran array on a offseted subsection of the previous # buffer a = np.memmap(filename, dtype=np.float64, shape=(3, 5, 4), mode='r+', order='F', offset=4) a[:] = np.arange(60).reshape(a.shape) # Build various views that share the buffer with the original memmap # b is an memmap sliced view on an memmap instance b = a[1:-1, 2:-1, 2:4] # c and d are array views c = np.asarray(b) d = c.T # Array reducer with auto dumping disabled reducer = ArrayMemmapReducer(None, TEMP_FOLDER, 'c') def reconstruct_array(x): cons, args = reducer(x) return cons(*args) def reconstruct_memmap(x): cons, args = reduce_memmap(x) return cons(*args) # Reconstruct original memmap a_reconstructed = reconstruct_memmap(a) assert_true(has_shareable_memory(a_reconstructed)) assert_true(isinstance(a_reconstructed, np.memmap)) assert_array_equal(a_reconstructed, a) # Reconstruct strided memmap view b_reconstructed = reconstruct_memmap(b) assert_true(has_shareable_memory(b_reconstructed)) assert_array_equal(b_reconstructed, b) # Reconstruct arrays views on memmap base c_reconstructed = reconstruct_array(c) assert_false(isinstance(c_reconstructed, np.memmap)) assert_true(has_shareable_memory(c_reconstructed)) assert_array_equal(c_reconstructed, c) d_reconstructed = reconstruct_array(d) assert_false(isinstance(d_reconstructed, np.memmap)) assert_true(has_shareable_memory(d_reconstructed)) assert_array_equal(d_reconstructed, d) # Test graceful degradation on fake memmap instances with in-memory # buffers a3 = a * 3 assert_false(has_shareable_memory(a3)) a3_reconstructed = reconstruct_memmap(a3) assert_false(has_shareable_memory(a3_reconstructed)) assert_false(isinstance(a3_reconstructed, np.memmap)) assert_array_equal(a3_reconstructed, a * 3) # Test graceful degradation on arrays derived from fake memmap instances b3 = np.asarray(a3) assert_false(has_shareable_memory(b3)) b3_reconstructed = reconstruct_array(b3) assert_true(isinstance(b3_reconstructed, np.ndarray)) assert_false(has_shareable_memory(b3_reconstructed)) assert_array_equal(b3_reconstructed, b3) @with_numpy @with_multiprocessing @with_temp_folder def test_high_dimension_memmap_array_reducing(): assert_array_equal = np.testing.assert_array_equal filename = os.path.join(TEMP_FOLDER, 'test.mmap') # Create a high dimensional memmap a = np.memmap(filename, dtype=np.float64, shape=(100, 15, 15, 3), mode='w+') a[:] = np.arange(100 * 15 * 15 * 3).reshape(a.shape) # Create some slices/indices at various dimensions b = a[0:10] c = a[:, 5:10] d = a[:, :, :, 0] e = a[1:3:4] def reconstruct_memmap(x): cons, args = reduce_memmap(x) res = cons(*args) return res a_reconstructed = reconstruct_memmap(a) assert_true(has_shareable_memory(a_reconstructed)) assert_true(isinstance(a_reconstructed, np.memmap)) assert_array_equal(a_reconstructed, a) b_reconstructed = reconstruct_memmap(b) assert_true(has_shareable_memory(b_reconstructed)) assert_array_equal(b_reconstructed, b) c_reconstructed = reconstruct_memmap(c) assert_true(has_shareable_memory(c_reconstructed)) assert_array_equal(c_reconstructed, c) d_reconstructed = reconstruct_memmap(d) assert_true(has_shareable_memory(d_reconstructed)) assert_array_equal(d_reconstructed, d) e_reconstructed = reconstruct_memmap(e) assert_true(has_shareable_memory(e_reconstructed)) assert_array_equal(e_reconstructed, e) @with_numpy @with_multiprocessing @with_temp_folder def test_pool_with_memmap(): """Check that subprocess can access and update shared memory memmap""" assert_array_equal = np.testing.assert_array_equal # Fork the subprocess before allocating the objects to be passed pool_temp_folder = os.path.join(TEMP_FOLDER, 'pool') os.makedirs(pool_temp_folder) p = MemmapingPool(10, max_nbytes=2, temp_folder=pool_temp_folder) try: filename = os.path.join(TEMP_FOLDER, 'test.mmap') a = np.memmap(filename, dtype=np.float32, shape=(3, 5), mode='w+') a.fill(1.0) p.map(inplace_double, [(a, (i, j), 1.0) for i in range(a.shape[0]) for j in range(a.shape[1])]) assert_array_equal(a, 2 * np.ones(a.shape)) # Open a copy-on-write view on the previous data b = np.memmap(filename, dtype=np.float32, shape=(5, 3), mode='c') p.map(inplace_double, [(b, (i, j), 2.0) for i in range(b.shape[0]) for j in range(b.shape[1])]) # Passing memmap instances to the pool should not trigger the creation # of new files on the FS assert_equal(os.listdir(pool_temp_folder), []) # the original data is untouched assert_array_equal(a, 2 * np.ones(a.shape)) assert_array_equal(b, 2 * np.ones(b.shape)) # readonly maps can be read but not updated c = np.memmap(filename, dtype=np.float32, shape=(10,), mode='r', offset=5 * 4) assert_raises(AssertionError, p.map, check_array, [(c, i, 3.0) for i in range(c.shape[0])]) # depending on the version of numpy one can either get a RuntimeError # or a ValueError assert_raises((RuntimeError, ValueError), p.map, inplace_double, [(c, i, 2.0) for i in range(c.shape[0])]) finally: # Clean all filehandlers held by the pool p.terminate() del p @with_numpy @with_multiprocessing @with_temp_folder def test_pool_with_memmap_array_view(): """Check that subprocess can access and update shared memory array""" assert_array_equal = np.testing.assert_array_equal # Fork the subprocess before allocating the objects to be passed pool_temp_folder = os.path.join(TEMP_FOLDER, 'pool') os.makedirs(pool_temp_folder) p = MemmapingPool(10, max_nbytes=2, temp_folder=pool_temp_folder) try: filename = os.path.join(TEMP_FOLDER, 'test.mmap') a = np.memmap(filename, dtype=np.float32, shape=(3, 5), mode='w+') a.fill(1.0) # Create an ndarray view on the memmap instance a_view = np.asarray(a) assert_false(isinstance(a_view, np.memmap)) assert_true(has_shareable_memory(a_view)) p.map(inplace_double, [(a_view, (i, j), 1.0) for i in range(a.shape[0]) for j in range(a.shape[1])]) # Both a and the a_view have been updated assert_array_equal(a, 2 * np.ones(a.shape)) assert_array_equal(a_view, 2 * np.ones(a.shape)) # Passing memmap array view to the pool should not trigger the # creation of new files on the FS assert_equal(os.listdir(pool_temp_folder), []) finally: p.terminate() del p @with_numpy @with_multiprocessing @with_temp_folder def test_memmaping_pool_for_large_arrays(): """Check that large arrays are not copied in memory""" # Check that the tempfolder is empty assert_equal(os.listdir(TEMP_FOLDER), []) # Build an array reducers that automaticaly dump large array content # to filesystem backed memmap instances to avoid memory explosion p = MemmapingPool(3, max_nbytes=40, temp_folder=TEMP_FOLDER) try: # The tempory folder for the pool is not provisioned in advance assert_equal(os.listdir(TEMP_FOLDER), []) assert_false(os.path.exists(p._temp_folder)) small = np.ones(5, dtype=np.float32) assert_equal(small.nbytes, 20) p.map(check_array, [(small, i, 1.0) for i in range(small.shape[0])]) # Memory has been copied, the pool filesystem folder is unused assert_equal(os.listdir(TEMP_FOLDER), []) # Try with a file larger than the memmap threshold of 40 bytes large = np.ones(100, dtype=np.float64) assert_equal(large.nbytes, 800) p.map(check_array, [(large, i, 1.0) for i in range(large.shape[0])]) # The data has been dumped in a temp folder for subprocess to share it # without per-child memory copies assert_true(os.path.isdir(p._temp_folder)) dumped_filenames = os.listdir(p._temp_folder) assert_equal(len(dumped_filenames), 2) # Check that memmory mapping is not triggered for arrays with # dtype='object' objects = np.array(['abc'] * 100, dtype='object') results = p.map(has_shareable_memory, [objects]) assert_false(results[0]) finally: # check FS garbage upon pool termination p.terminate() assert_false(os.path.exists(p._temp_folder)) del p @with_numpy @with_multiprocessing @with_temp_folder def test_memmaping_pool_for_large_arrays_disabled(): """Check that large arrays memmaping can be disabled""" # Set max_nbytes to None to disable the auto memmaping feature p = MemmapingPool(3, max_nbytes=None, temp_folder=TEMP_FOLDER) try: # Check that the tempfolder is empty assert_equal(os.listdir(TEMP_FOLDER), []) # Try with a file largish than the memmap threshold of 40 bytes large = np.ones(100, dtype=np.float64) assert_equal(large.nbytes, 800) p.map(check_array, [(large, i, 1.0) for i in range(large.shape[0])]) # Check that the tempfolder is still empty assert_equal(os.listdir(TEMP_FOLDER), []) finally: # Cleanup open file descriptors p.terminate() del p @with_numpy @with_multiprocessing @with_dev_shm def test_memmaping_on_dev_shm(): """Check that MemmapingPool uses /dev/shm when possible""" p = MemmapingPool(3, max_nbytes=10) try: # Check that the pool has correctly detected the presence of the # shared memory filesystem. pool_temp_folder = p._temp_folder folder_prefix = '/dev/shm/joblib_memmaping_pool_' assert_true(pool_temp_folder.startswith(folder_prefix)) assert_true(os.path.exists(pool_temp_folder)) # Try with a file larger than the memmap threshold of 10 bytes a = np.ones(100, dtype=np.float64) assert_equal(a.nbytes, 800) p.map(id, [a] * 10) # a should have been memmaped to the pool temp folder: the joblib # pickling procedure generate a .pkl and a .npy file: assert_equal(len(os.listdir(pool_temp_folder)), 2) # create a new array with content that is different from 'a' so that # it is mapped to a different file in the temporary folder of the # pool. b = np.ones(100, dtype=np.float64) * 2 assert_equal(b.nbytes, 800) p.map(id, [b] * 10) # A copy of both a and b are now stored in the shared memory folder assert_equal(len(os.listdir(pool_temp_folder)), 4) finally: # Cleanup open file descriptors p.terminate() del p # The temp folder is cleaned up upon pool termination assert_false(os.path.exists(pool_temp_folder)) @with_numpy @with_multiprocessing @with_temp_folder def test_memmaping_pool_for_large_arrays_in_return(): """Check that large arrays are not copied in memory in return""" assert_array_equal = np.testing.assert_array_equal # Build an array reducers that automaticaly dump large array content # but check that the returned datastructure are regular arrays to avoid # passing a memmap array pointing to a pool controlled temp folder that # might be confusing to the user # The MemmapingPool user can always return numpy.memmap object explicitly # to avoid memory copy p = MemmapingPool(3, max_nbytes=10, temp_folder=TEMP_FOLDER) try: res = p.apply_async(np.ones, args=(1000,)) large = res.get() assert_false(has_shareable_memory(large)) assert_array_equal(large, np.ones(1000)) finally: p.terminate() del p def _worker_multiply(a, n_times): """Multiplication function to be executed by subprocess""" assert_true(has_shareable_memory(a)) return a * n_times @with_numpy @with_multiprocessing @with_temp_folder def test_workaround_against_bad_memmap_with_copied_buffers(): """Check that memmaps with a bad buffer are returned as regular arrays Unary operations and ufuncs on memmap instances return a new memmap instance with an in-memory buffer (probably a numpy bug). """ assert_array_equal = np.testing.assert_array_equal p = MemmapingPool(3, max_nbytes=10, temp_folder=TEMP_FOLDER) try: # Send a complex, large-ish view on a array that will be converted to # a memmap in the worker process a = np.asarray(np.arange(6000).reshape((1000, 2, 3)), order='F')[:, :1, :] # Call a non-inplace multiply operation on the worker and memmap and # send it back to the parent. b = p.apply_async(_worker_multiply, args=(a, 3)).get() assert_false(has_shareable_memory(b)) assert_array_equal(b, 3 * a) finally: p.terminate() del p joblib-0.9.4/joblib/test/test_testing.py000066400000000000000000000051471264716474700203240ustar00rootroot00000000000000import sys import re from nose.tools import assert_raises from joblib.testing import assert_raises_regex, check_subprocess_call def test_check_subprocess_call(): code = '\n'.join(['result = 1 + 2 * 3', 'print(result)', 'my_list = [1, 2, 3]', 'print(my_list)']) check_subprocess_call([sys.executable, '-c', code]) # Now checking stdout with a regex check_subprocess_call([sys.executable, '-c', code], # Regex needed for platform-specific line endings stdout_regex=r'7\s{1,2}\[1, 2, 3\]') def test_check_subprocess_call_non_matching_regex(): code = '42' non_matching_pattern = '_no_way_this_matches_anything_' assert_raises_regex(ValueError, 'Unexpected output.+{0}'.format(non_matching_pattern), check_subprocess_call, [sys.executable, '-c', code], stdout_regex=non_matching_pattern) def test_check_subprocess_call_wrong_command(): wrong_command = '_a_command_that_does_not_exist_' assert_raises(OSError, check_subprocess_call, [wrong_command]) def test_check_subprocess_call_non_zero_return_code(): code_with_non_zero_exit = '\n'.join([ 'import sys', 'print("writing on stdout")', 'sys.stderr.write("writing on stderr")', 'sys.exit(123)']) pattern = re.compile('Non-zero return code: 123.+' 'Stdout:\nwriting on stdout.+' 'Stderr:\nwriting on stderr', re.DOTALL) assert_raises_regex(ValueError, pattern, check_subprocess_call, [sys.executable, '-c', code_with_non_zero_exit]) def test_check_subprocess_call_timeout(): code_timing_out = '\n'.join([ 'import time', 'import sys', 'print("before sleep on stdout")', 'sys.stdout.flush()', 'sys.stderr.write("before sleep on stderr")', 'sys.stderr.flush()', 'time.sleep(1.1)', 'print("process should have be killed before")', 'sys.stdout.flush()']) pattern = re.compile('Non-zero return code:.+' 'Stdout:\nbefore sleep on stdout\s+' 'Stderr:\nbefore sleep on stderr', re.DOTALL) assert_raises_regex(ValueError, pattern, check_subprocess_call, [sys.executable, '-c', code_timing_out], timeout=1) joblib-0.9.4/joblib/testing.py000066400000000000000000000052161264716474700163030ustar00rootroot00000000000000""" Helper for testing. """ import sys import warnings import os.path import re import subprocess import threading from joblib._compat import PY3_OR_LATER def warnings_to_stdout(): """ Redirect all warnings to stdout. """ showwarning_orig = warnings.showwarning def showwarning(msg, cat, fname, lno, file=None, line=0): showwarning_orig(msg, cat, os.path.basename(fname), line, sys.stdout) warnings.showwarning = showwarning #warnings.simplefilter('always') try: from nose.tools import assert_raises_regex except ImportError: # For Python 2.7 try: from nose.tools import assert_raises_regexp as assert_raises_regex except ImportError: # for Python 2.6 def assert_raises_regex(expected_exception, expected_regexp, callable_obj=None, *args, **kwargs): """Helper function to check for message patterns in exceptions""" not_raised = False try: callable_obj(*args, **kwargs) not_raised = True except Exception as e: error_message = str(e) if not re.compile(expected_regexp).search(error_message): raise AssertionError("Error message should match pattern " "%r. %r does not." % (expected_regexp, error_message)) if not_raised: raise AssertionError("Should have raised %r" % expected_exception(expected_regexp)) def check_subprocess_call(cmd, timeout=1, stdout_regex=None): """Runs a command in a subprocess with timeout in seconds. Also checks returncode is zero and stdout if stdout_regex is set. """ proc = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE) def kill_process(): proc.kill() timer = threading.Timer(timeout, kill_process) try: timer.start() stdout, stderr = proc.communicate() if PY3_OR_LATER: stdout, stderr = stdout.decode(), stderr.decode() if proc.returncode != 0: message = ( 'Non-zero return code: {0}.\nStdout:\n{1}\n' 'Stderr:\n{2}').format( proc.returncode, stdout, stderr) raise ValueError(message) if (stdout_regex is not None and not re.search(stdout_regex, stdout)): raise ValueError( "Unexpected output: '{0!r}' does not match:\n{1!r}".format( stdout_regex, stdout)) finally: timer.cancel() joblib-0.9.4/setup.cfg000066400000000000000000000010721264716474700146300ustar00rootroot00000000000000[aliases] release = egg_info -RDb '' # Make sure the sphinx docs are built each time we do a dist. bdist = build_sphinx bdist sdist = build_sphinx sdist # Make sure a zip file is created each time we build the sphinx docs build_sphinx = build_sphinx zip_help # Make sure the docs are uploaded when we do an upload upload = upload upload_help [bdist_rpm] doc-files = doc [nosetests] verbosity = 2 detailed-errors = 1 with-coverage = 1 cover-package = joblib #pdb = 1 #pdb-failures = 1 with-doctest=1 doctest-extension=rst doctest-fixtures=_fixture [wheel] universal=1 joblib-0.9.4/setup.py000077500000000000000000000046251264716474700145330ustar00rootroot00000000000000#!/usr/bin/env python from distutils.core import setup import sys import joblib # For some commands, use setuptools if len(set(('develop', 'sdist', 'release', 'bdist_egg', 'bdist_rpm', 'bdist', 'bdist_dumb', 'bdist_wininst', 'install_egg_info', 'build_sphinx', 'egg_info', 'easy_install', 'upload', )).intersection(sys.argv)) > 0: from setupegg import extra_setuptools_args # extra_setuptools_args is injected by the setupegg.py script, for # running the setup with setuptools. if not 'extra_setuptools_args' in globals(): extra_setuptools_args = dict() # if nose available, provide test command try: from nose.commands import nosetests cmdclass = extra_setuptools_args.pop('cmdclass', {}) cmdclass['test'] = nosetests cmdclass['nosetests'] = nosetests extra_setuptools_args['cmdclass'] = cmdclass except ImportError: pass if __name__ == '__main__': # Protect the call to the setup function to prevent a fork-bomb # when running the tests with: # python setup.py nosetests setup(name='joblib', version=joblib.__version__, author='Gael Varoquaux', author_email='gael.varoquaux@normalesup.org', url='http://pythonhosted.org/joblib/', description=""" Lightweight pipelining: using Python functions as pipeline jobs. """, long_description=joblib.__doc__, license='BSD', classifiers=[ 'Development Status :: 5 - Production/Stable', 'Environment :: Console', 'Intended Audience :: Developers', 'Intended Audience :: Science/Research', 'Intended Audience :: Education', 'License :: OSI Approved :: BSD License', 'Operating System :: OS Independent', 'Programming Language :: Python :: 2.6', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.3', 'Programming Language :: Python :: 3.4', 'Topic :: Scientific/Engineering', 'Topic :: Utilities', 'Topic :: Software Development :: Libraries', ], platforms='any', package_data={'joblib.test': ['data/*.gz', 'data/*.pkl', 'data/*.npy']}, packages=['joblib', 'joblib.test', 'joblib.test.data'], **extra_setuptools_args) joblib-0.9.4/setupegg.py000077500000000000000000000051251264716474700152120ustar00rootroot00000000000000#!/usr/bin/env python """Wrapper to run setup.py using setuptools.""" import zipfile import os from setuptools import Command from sphinx_pypi_upload import UploadDoc ############################################################################### # Code to copy the sphinx-generated html docs in the distribution. DOC_BUILD_DIR = os.path.join('build', 'sphinx', 'html') def relative_path(filename): """ Return the relative path to the file, assuming the file is in the DOC_BUILD_DIR directory. """ length = len(os.path.abspath(DOC_BUILD_DIR)) + 1 return os.path.abspath(filename)[length:] class ZipHelp(Command): description = "zip the help created by the build_sphinx, " + \ "and put it in the source distribution. " user_options = [ ('None', None, 'this command has no options'), ] def run(self): if not os.path.exists(DOC_BUILD_DIR): raise OSError('Doc directory does not exist.') target_file = os.path.join('doc', 'documentation.zip') # ZIP_DEFLATED actually compresses the archive. However, there # will be a RuntimeError if zlib is not installed, so we check # for it. ZIP_STORED produces an uncompressed zip, but does not # require zlib. try: zf = zipfile.ZipFile(target_file, 'w', compression=zipfile.ZIP_DEFLATED) except RuntimeError: zf = zipfile.ZipFile(target_file, 'w', compression=zipfile.ZIP_STORED) for root, dirs, files in os.walk(DOC_BUILD_DIR): relative = relative_path(root) if not relative.startswith('.doctrees'): for f in files: zf.write(os.path.join(root, f), os.path.join(relative, f)) zf.close() def initialize_options(self): pass def finalize_options(self): pass ############################################################################### # Call the setup.py script, injecting the setuptools-specific arguments. extra_setuptools_args = dict( tests_require=['nose', 'coverage'], test_suite='nose.collector', cmdclass={'zip_help': ZipHelp, 'upload_help': UploadDoc}, zip_safe=False, ) if __name__ == '__main__': execfile('setup.py', dict(__name__='__main__', extra_setuptools_args=extra_setuptools_args)) joblib-0.9.4/sphinx_pypi_upload.py000066400000000000000000000115271264716474700173050ustar00rootroot00000000000000# -*- coding: utf-8 -*- """ sphinx_pypi_upload ~~~~~~~~~~~~~~~~~~ setuptools command for uploading Sphinx documentation to PyPI :author: Jannis Leidel :contact: jannis@leidel.info :copyright: Copyright 2009, Jannis Leidel. :license: BSD, see LICENSE for details. Modified for joblib by Gael Varoquaux """ import sys import os import socket try: import httplib import urlparse from cStringIO import StringIO as BytesIO except ImportError: # Python3k import http as httplib from urllib import parse as urlparse from io import BytesIO import base64 from distutils import log from distutils.command.upload import upload class UploadDoc(upload): """Distutils command to upload Sphinx documentation.""" description = 'Upload Sphinx documentation to PyPI' user_options = [ ('repository=', 'r', "url of repository [default: %s]" % upload.DEFAULT_REPOSITORY), ('show-response', None, 'display full response text from server'), ('upload-file=', None, 'file to upload'), ] boolean_options = upload.boolean_options def initialize_options(self): upload.initialize_options(self) self.upload_file = None def finalize_options(self): upload.finalize_options(self) if self.upload_file is None: self.upload_file = 'doc/documentation.zip' self.announce('Using upload file %s' % self.upload_file) def upload(self, filename): content = open(filename, 'rb').read() meta = self.distribution.metadata data = { ':action': 'doc_upload', 'name': meta.get_name(), 'content': (os.path.basename(filename), content), } # set up the authentication auth = "Basic " + base64.encodestring(self.username + ":" + \ self.password).strip() # Build up the MIME payload for the POST data boundary = '--------------GHSKFJDLGDS7543FJKLFHRE75642756743254' sep_boundary = '\n--' + boundary end_boundary = sep_boundary + '--' body = BytesIO() for key, value in data.items(): # handle multiple entries for the same name if type(value) != type([]): value = [value] for value in value: if type(value) is tuple: fn = ';filename="%s"' % value[0] value = value[1] else: fn = "" value = str(value) body.write(sep_boundary) body.write('\nContent-Disposition: form-data; name="%s"' % key) body.write(fn) body.write("\n\n") body.write(value) if value and value[-1] == '\r': body.write('\n') # write an extra newline (lurve Macs) body.write(end_boundary) body.write("\n") body = body.getvalue() self.announce("Submitting documentation to %s" % (self.repository), log.INFO) # build the Request # We can't use urllib2 since we need to send the Basic # auth right with the first request schema, netloc, url, params, query, fragments = \ urlparse.urlparse(self.repository) assert not params and not query and not fragments if schema == 'http': http = httplib.HTTPConnection(netloc) elif schema == 'https': http = httplib.HTTPSConnection(netloc) else: raise AssertionError("unsupported schema " + schema) data = '' loglevel = log.INFO try: http.connect() http.putrequest("POST", url) http.putheader('Content-type', 'multipart/form-data; boundary=%s' % boundary) http.putheader('Content-length', str(len(body))) http.putheader('Authorization', auth) http.endheaders() http.send(body) except socket.error as e: self.announce(str(e), log.ERROR) return response = http.getresponse() if response.status == 200: self.announce('Server response (%s): %s' % (response.status, response.reason), log.INFO) elif response.status == 301: location = response.getheader('Location') if location is None: location = 'http://pythonhosted.org/%s/' % meta.get_name() self.announce('Upload successful. Visit %s' % location, log.INFO) else: self.announce('Upload failed (%s): %s' % \ (response.status, response.reason), log.ERROR) if self.show_response: print('-' * 75 + response.read() + '-' * 75) def run(self): zip_file = self.upload_file self.upload(zip_file)