pax_global_header00006660000000000000000000000064131734435640014524gustar00rootroot0000000000000052 comment=0cad75bc2de6caaadbe78ccfa7902ea92ab1e950 scikit-learn-0.19.1/000077500000000000000000000000001317344356400142015ustar00rootroot00000000000000scikit-learn-0.19.1/.codecov.yml000066400000000000000000000012361317344356400164260ustar00rootroot00000000000000comment: off coverage: status: project: default: # Commits pushed to master should not make the overall # project coverage decrease by more than 1%: target: auto threshold: 1% patch: default: # Be tolerant on slight code coverage diff on PRs to limit # noisy red coverage status on github PRs. # Note The coverage stats are still uploaded # to codecov so that PR reviewers can see uncovered lines # in the github diff if they install the codecov browser # extension: # https://github.com/codecov/browser-extension target: auto threshold: 1% scikit-learn-0.19.1/.coveragerc000066400000000000000000000001761317344356400163260ustar00rootroot00000000000000[run] branch = True source = sklearn include = */sklearn/* omit = */sklearn/externals/* */benchmarks/* */setup.py scikit-learn-0.19.1/.gitattributes000066400000000000000000000000371317344356400170740ustar00rootroot00000000000000/doc/whats_new.rst merge=union scikit-learn-0.19.1/.gitignore000066400000000000000000000012661317344356400161760ustar00rootroot00000000000000*.pyc *.so *.pyd *~ .#* *.lprof *.swp *.swo .DS_Store build sklearn/datasets/__config__.py sklearn/**/*.html dist/ MANIFEST doc/_build/ doc/auto_examples/ doc/modules/generated/ doc/datasets/generated/ *.pdf pip-log.txt scikit_learn.egg-info/ .coverage coverage *.py,cover .tags* tags covtype.data.gz 20news-18828/ 20news-18828.tar.gz coverages.zip samples.zip doc/coverages.zip doc/samples.zip coverages samples doc/coverages doc/samples *.prof .tox/ .coverage lfw_preprocessed/ nips2010_pdf/ *.nt.bz2 *.tar.gz *.tgz examples/cluster/joblib reuters/ benchmarks/bench_covertype_data/ *.prefs .pydevproject .idea *.c *.cpp !*/src/*.c !*/src/*.cpp *.sln *.pyproj # Used by py.test .cache scikit-learn-0.19.1/.landscape.yml000066400000000000000000000001261317344356400167330ustar00rootroot00000000000000pylint: disable: - unpacking-non-sequence ignore-paths: - sklearn/externals scikit-learn-0.19.1/.mailmap000066400000000000000000000161371317344356400156320ustar00rootroot00000000000000Alexandre Gramfort Alexandre Gramfort Alexandre Gramfort Alexandre Saint Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Arnaud Joly Arnaud Joly Arnaud Joly Anne-Laure Fouque Ariel Rokem arokem Bala Subrahmanyam Varanasi Bertrand Thirion Brandyn A. White Brian Cheung Brian Cheung Brian Cheung Brian Holt Christian Osendorfer Clay Woolam Danny Sullivan Denis Engemann Denis Engemann Denis Engemann Denis Engemann dengemann Diego Molla DraXus draxus Edouard DUCHESNAY Edouard DUCHESNAY Edouard DUCHESNAY Emmanuelle Gouillart Emmanuelle Gouillart Eustache Diemert Fabian Pedregosa Fabian Pedregosa Fabian Pedregosa Federico Vaggi Federico Vaggi Gael Varoquaux Gael Varoquaux Gael Varoquaux Giorgio Patrini Giorgio Patrini Gilles Louppe Hamzeh Alsalhi <93hamsal@gmail.com> Harikrishnan S Hendrik Heuer Henry Lin Hrishikesh Huilgolkar Hugo Bowne-Anderson Imaculate Immanuel Bayer Jacob Schreiber Jacob Schreiber Jake VanderPlas Jake VanderPlas Jake VanderPlas James Bergstra Jaques Grobler Jan Schlüter Jean Kossaifi Jean Kossaifi Jean Kossaifi Joel Nothman Kyle Kastner Lars Buitinck Lars Buitinck Lars Buitinck Lars Buitinck Lars Buitinck Loic Esteve Manoj Kumar Matthieu Perrot Maheshakya Wijewardena Michael Bommarito Michael Eickenberg Michael Eickenberg Samuel Charron Sergio Medina Nelle Varoquaux Nelle Varoquaux Nelle Varoquaux Nicolas Goix Nicolas Pinto Noel Dawe Noel Dawe Olivier Grisel Olivier Grisel Olivier Hervieu Paul Butler Peter Prettenhofer Raghav RV Raghav RV Robert Layton Roman Sinayev Roman Sinayev Ronald Phlypo Satrajit Ghosh Sebastian Raschka Sebastian Raschka Shiqiao Du Shiqiao Du Thomas Unterthiner Tim Sheerman-Chase Vincent Dubourg Vincent Dubourg Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Schut Virgile Fritsch Virgile Fritsch Vlad Niculae Wei Li Wei Li X006 Xinfan Meng Yannick Schwartz Yannick Schwartz Yannick Schwartz scikit-learn-0.19.1/.travis.yml000066400000000000000000000054631317344356400163220ustar00rootroot00000000000000# make it explicit that we favor the new container-based travis workers sudo: false language: python cache: apt: true directories: - $HOME/.cache/pip - $HOME/.ccache dist: trusty env: global: # Directory where tests are run from - TEST_DIR=/tmp/sklearn - OMP_NUM_THREADS=4 - OPENBLAS_NUM_THREADS=4 matrix: include: # This environment tests that scikit-learn can be built against # versions of numpy, scipy with ATLAS that comes with Ubuntu Trusty 14.04 - env: DISTRIB="ubuntu" PYTHON_VERSION="2.7" CYTHON_VERSION="0.23.4" COVERAGE=true addons: apt: packages: # these only required by the DISTRIB="ubuntu" builds: - python-scipy - libatlas3gf-base - libatlas-dev # This environment tests the oldest supported anaconda env - env: DISTRIB="conda" PYTHON_VERSION="2.7" INSTALL_MKL="false" NUMPY_VERSION="1.8.2" SCIPY_VERSION="0.13.3" CYTHON_VERSION="0.23.5" COVERAGE=true # This environment tests the newest supported Anaconda release (4.4.0) # It also runs tests requiring Pandas. - env: DISTRIB="conda" PYTHON_VERSION="3.6.1" INSTALL_MKL="true" NUMPY_VERSION="1.12.1" SCIPY_VERSION="0.19.0" PANDAS_VERSION="0.20.1" CYTHON_VERSION="0.25.2" COVERAGE=true # This environment use pytest to run the tests. It uses the newest # supported Anaconda release (4.4.0). It also runs tests requiring Pandas. - env: USE_PYTEST="true" DISTRIB="conda" PYTHON_VERSION="3.6.1" INSTALL_MKL="true" NUMPY_VERSION="1.12.1" SCIPY_VERSION="0.19.0" PANDAS_VERSION="0.20.1" CYTHON_VERSION="0.25.2" # flake8 linting on diff wrt common ancestor with upstream/master - env: RUN_FLAKE8="true" SKIP_TESTS="true" DISTRIB="conda" PYTHON_VERSION="3.5" INSTALL_MKL="true" NUMPY_VERSION="1.12.1" SCIPY_VERSION="0.19.0" CYTHON_VERSION="0.23.5" TEST_DOCSTRINGS="true" # This environment tests scikit-learn against numpy and scipy master # installed from their CI wheels in a virtualenv with the Python # interpreter provided by travis. - python: 3.5 env: DISTRIB="scipy-dev-wheels" allow_failures: # allow_failures seems to be keyed on the python version # We are using this to allow failures for DISTRIB=scipy-dev-wheels - python: 3.5 install: source build_tools/travis/install.sh script: bash build_tools/travis/test_script.sh after_success: source build_tools/travis/after_success.sh notifications: webhooks: urls: - https://webhooks.gitter.im/e/4ffabb4df010b70cd624 on_success: change # options: [always|never|change] default: always on_failure: always # options: [always|never|change] default: always on_start: never # options: [always|never|change] default: always scikit-learn-0.19.1/AUTHORS.rst000066400000000000000000000052171317344356400160650ustar00rootroot00000000000000.. -*- mode: rst -*- This is a community effort, and as such many people have contributed to it over the years. History ------- This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started work on this project as part of his thesis. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. Since then, several releases have appeared following a ~3 month cycle, and a thriving international community has been leading the development. People ------ The following people have been core contributors to scikit-learn's development and maintenance: .. hlist:: * `Mathieu Blondel `_ * `Matthieu Brucher `_ * Lars Buitinck * David Cournapeau * `Noel Dawe `_ * Vincent Dubourg * Edouard Duchesnay * `Tom Dupré la Tour `_ * Alexander Fabisch * `Virgile Fritsch `_ * `Satra Ghosh `_ * `Angel Soler Gollonet `_ * Chris Filo Gorgolewski * `Alexandre Gramfort `_ * `Olivier Grisel `_ * `Jaques Grobler `_ * `Yaroslav Halchenko `_ * `Brian Holt `_ * `Arnaud Joly `_ * Thouis (Ray) Jones * `Kyle Kastner `_ * `Manoj Kumar `_ * Robert Layton * `Wei Li `_ * Paolo Losi * `Gilles Louppe `_ * `Jan Hendrik Metzen `_ * Vincent Michel * Jarrod Millman * `Andreas Müller `_ (release manager) * `Vlad Niculae `_ * `Joel Nothman `_ * `Alexandre Passos `_ * `Fabian Pedregosa `_ * `Peter Prettenhofer `_ * Bertrand Thirion * `Jake VanderPlas `_ * Nelle Varoquaux * `Gael Varoquaux `_ * Ron Weiss Please do not email the authors directly to ask for assistance or report issues. Instead, please see `What's the best way to ask questions about scikit-learn `_ in the FAQ. scikit-learn-0.19.1/CONTRIBUTING.md000066400000000000000000000242661317344356400164440ustar00rootroot00000000000000 Contributing to scikit-learn ============================ **Note: This document is a 'getting started' summary for contributing code, documentation, testing, and filing issues.** Visit the [**Contributing page**](http://scikit-learn.org/stable/developers/contributing.html) for the full contributor's guide. Please read it carefully to help make the code review process go as smoothly as possible and maximize the likelihood of your contribution being merged. How to contribute ----------------- The preferred workflow for contributing to scikit-learn is to fork the [main repository](https://github.com/scikit-learn/scikit-learn) on GitHub, clone, and develop on a branch. Steps: 1. Fork the [project repository](https://github.com/scikit-learn/scikit-learn) by clicking on the 'Fork' button near the top right of the page. This creates a copy of the code under your GitHub user account. For more details on how to fork a repository see [this guide](https://help.github.com/articles/fork-a-repo/). 2. Clone your fork of the scikit-learn repo from your GitHub account to your local disk: ```bash $ git clone git@github.com:YourLogin/scikit-learn.git $ cd scikit-learn ``` 3. Create a ``feature`` branch to hold your development changes: ```bash $ git checkout -b my-feature ``` Always use a ``feature`` branch. It's good practice to never work on the ``master`` branch! 4. Develop the feature on your feature branch. Add changed files using ``git add`` and then ``git commit`` files: ```bash $ git add modified_files $ git commit ``` to record your changes in Git, then push the changes to your GitHub account with: ```bash $ git push -u origin my-feature ``` 5. Follow [these instructions](https://help.github.com/articles/creating-a-pull-request-from-a-fork) to create a pull request from your fork. This will send an email to the committers. (If any of the above seems like magic to you, please look up the [Git documentation](https://git-scm.com/documentation) on the web, or ask a friend or another contributor for help.) Pull Request Checklist ---------------------- We recommended that your contribution complies with the following rules before you submit a pull request: - Follow the [coding-guidelines](http://scikit-learn.org/dev/developers/contributing.html#coding-guidelines). - Use, when applicable, the validation tools and scripts in the `sklearn.utils` submodule. A list of utility routines available for developers can be found in the [Utilities for Developers](http://scikit-learn.org/dev/developers/utilities.html#developers-utils) page. - Give your pull request a helpful title that summarises what your contribution does. In some cases `Fix ` is enough. `Fix #` is not enough. - Often pull requests resolve one or more other issues (or pull requests). If merging your pull request means that some other issues/PRs should be closed, you should [use keywords to create link to them](https://github.com/blog/1506-closing-issues-via-pull-requests/) (e.g., `Fixes #1234`; multiple issues/PRs are allowed as long as each one is preceded by a keyword). Upon merging, those issues/PRs will automatically be closed by GitHub. If your pull request is simply related to some other issues/PRs, create a link to them without using the keywords (e.g., `See also #1234`). - All public methods should have informative docstrings with sample usage presented as doctests when appropriate. - Please prefix the title of your pull request with `[MRG]` (Ready for Merge), if the contribution is complete and ready for a detailed review. Two core developers will review your code and change the prefix of the pull request to `[MRG + 1]` and `[MRG + 2]` on approval, making it eligible for merging. An incomplete contribution -- where you expect to do more work before receiving a full review -- should be prefixed `[WIP]` (to indicate a work in progress) and changed to `[MRG]` when it matures. WIPs may be useful to: indicate you are working on something to avoid duplicated work, request broad review of functionality or API, or seek collaborators. WIPs often benefit from the inclusion of a [task list](https://github.com/blog/1375-task-lists-in-gfm-issues-pulls-comments) in the PR description. - All other tests pass when everything is rebuilt from scratch. On Unix-like systems, check with (from the toplevel source folder): ```bash $ make ``` - When adding additional functionality, provide at least one example script in the ``examples/`` folder. Have a look at other examples for reference. Examples should demonstrate why the new functionality is useful in practice and, if possible, compare it to other methods available in scikit-learn. - Documentation and high-coverage tests are necessary for enhancements to be accepted. Bug-fixes or new features should be provided with [non-regression tests](https://en.wikipedia.org/wiki/Non-regression_testing). These tests verify the correct behavior of the fix or feature. In this manner, further modifications on the code base are granted to be consistent with the desired behavior. For the Bug-fixes case, at the time of the PR, this tests should fail for the code base in master and pass for the PR code. - At least one paragraph of narrative documentation with links to references in the literature (with PDF links when possible) and the example. - The documentation should also include expected time and space complexity of the algorithm and scalability, e.g. "this algorithm can scale to a large number of samples > 100000, but does not scale in dimensionality: n_features is expected to be lower than 100". You can also check for common programming errors with the following tools: - Code with good unittest **coverage** (at least 80%), check with: ```bash $ pip install nose coverage $ nosetests --with-coverage path/to/tests_for_package ``` - No pyflakes warnings, check with: ```bash $ pip install pyflakes $ pyflakes path/to/module.py ``` - No PEP8 warnings, check with: ```bash $ pip install pep8 $ pep8 path/to/module.py ``` - AutoPEP8 can help you fix some of the easy redundant errors: ```bash $ pip install autopep8 $ autopep8 path/to/pep8.py ``` Bonus points for contributions that include a performance analysis with a benchmark script and profiling output (please report on the mailing list or on the GitHub issue). Filing bugs ----------- We use GitHub issues to track all bugs and feature requests; feel free to open an issue if you have found a bug or wish to see a feature implemented. It is recommended to check that your issue complies with the following rules before submitting: - Verify that your issue is not being currently addressed by other [issues](https://github.com/scikit-learn/scikit-learn/issues?q=) or [pull requests](https://github.com/scikit-learn/scikit-learn/pulls?q=). - If you are submitting an algorithm or feature request, please verify that the algorithm fulfills our [new algorithm requirements](http://scikit-learn.org/stable/faq.html#can-i-add-this-new-algorithm-that-i-or-someone-else-just-published). - Please ensure all code snippets and error messages are formatted in appropriate code blocks. See [Creating and highlighting code blocks](https://help.github.com/articles/creating-and-highlighting-code-blocks). - Please include your operating system type and version number, as well as your Python, scikit-learn, numpy, and scipy versions. This information can be found by running the following code snippet: ```python import platform; print(platform.platform()) import sys; print("Python", sys.version) import numpy; print("NumPy", numpy.__version__) import scipy; print("SciPy", scipy.__version__) import sklearn; print("Scikit-Learn", sklearn.__version__) ``` - Please be specific about what estimators and/or functions are involved and the shape of the data, as appropriate; please include a [reproducible](http://stackoverflow.com/help/mcve) code snippet or link to a [gist](https://gist.github.com). If an exception is raised, please provide the traceback. New contributor tips -------------------- A great way to start contributing to scikit-learn is to pick an item from the list of [good first issues](https://github.com/scikit-learn/scikit-learn/labels/good%20first%20issue). If you have already contributed to scikit-learn look at [Easy issues](https://github.com/scikit-learn/scikit-learn/labels/Easy) instead. Resolving these issues allow you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues. Documentation ------------- We are glad to accept any sort of documentation: function docstrings, reStructuredText documents (like this one), tutorials, etc. reStructuredText documents live in the source code repository under the doc/ directory. You can edit the documentation using any text editor and then generate the HTML output by typing ``make html`` from the doc/ directory. Alternatively, ``make`` can be used to quickly generate the documentation without the example gallery. The resulting HTML files will be placed in ``_build/html/`` and are viewable in a web browser. See the ``README`` file in the ``doc/`` directory for more information. For building the documentation, you will need [sphinx](http://sphinx.pocoo.org/), [matplotlib](http://matplotlib.org/), and [pillow](http://pillow.readthedocs.io/en/latest/). When you are writing documentation, it is important to keep a good compromise between mathematical and algorithmic details, and give intuition to the reader on what the algorithm does. It is best to always start with a small paragraph with a hand-waving explanation of what the method does to the data and a figure (coming from an example) illustrating it. Further Information ------------------- Visit the [Contributing Code](http://scikit-learn.org/stable/developers/index.html#coding-guidelines) section of the website for more information including conforming to the API spec and profiling contributed code. scikit-learn-0.19.1/COPYING000066400000000000000000000030271317344356400152360ustar00rootroot00000000000000New BSD License Copyright (c) 2007–2017 The scikit-learn developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. b. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. c. Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. scikit-learn-0.19.1/ISSUE_TEMPLATE.md000066400000000000000000000033501317344356400167070ustar00rootroot00000000000000 #### Description #### Steps/Code to Reproduce #### Expected Results #### Actual Results #### Versions scikit-learn-0.19.1/MANIFEST.in000066400000000000000000000003631317344356400157410ustar00rootroot00000000000000include *.rst recursive-include doc * recursive-include examples * recursive-include sklearn *.c *.h *.pyx *.pxd *.pxi recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt include COPYING include AUTHORS.rst include README.rst scikit-learn-0.19.1/Makefile000066400000000000000000000027501317344356400156450ustar00rootroot00000000000000# simple makefile to simplify repetitive build env management tasks under posix # caution: testing won't work on windows, see README PYTHON ?= python CYTHON ?= cython NOSETESTS ?= nosetests CTAGS ?= ctags # skip doctests on 32bit python BITS := $(shell python -c 'import struct; print(8 * struct.calcsize("P"))') ifeq ($(BITS),32) NOSETESTS:=$(NOSETESTS) -c setup32.cfg endif all: clean inplace test clean-ctags: rm -f tags clean: clean-ctags $(PYTHON) setup.py clean rm -rf dist in: inplace # just a shortcut inplace: $(PYTHON) setup.py build_ext -i test-code: in $(NOSETESTS) -s -v sklearn test-sphinxext: $(NOSETESTS) -s -v doc/sphinxext/ test-doc: ifeq ($(BITS),64) $(NOSETESTS) -s -v doc/*.rst doc/modules/ doc/datasets/ \ doc/developers doc/tutorial/basic doc/tutorial/statistical_inference \ doc/tutorial/text_analytics endif test-coverage: rm -rf coverage .coverage $(NOSETESTS) -s -v --with-coverage sklearn test: test-code test-sphinxext test-doc trailing-spaces: find sklearn -name "*.py" -exec perl -pi -e 's/[ \t]*$$//' {} \; cython: python setup.py build_src ctags: # make tags for symbol based navigation in emacs and vim # Install with: sudo apt-get install exuberant-ctags $(CTAGS) --python-kinds=-i -R sklearn doc: inplace $(MAKE) -C doc html doc-noplot: inplace $(MAKE) -C doc html-noplot code-analysis: flake8 sklearn | grep -v __init__ | grep -v external pylint -E -i y sklearn/ -d E1103,E0611,E1101 flake8-diff: ./build_tools/travis/flake8_diff.sh scikit-learn-0.19.1/PULL_REQUEST_TEMPLATE.md000066400000000000000000000022561317344356400200070ustar00rootroot00000000000000 #### Reference Issues/PRs #### What does this implement/fix? Explain your changes. #### Any other comments? scikit-learn-0.19.1/README.rst000066400000000000000000000132421317344356400156720ustar00rootroot00000000000000.. -*- mode: rst -*- |Travis|_ |AppVeyor|_ |Codecov|_ |CircleCI|_ |Python27|_ |Python35|_ |PyPi|_ |DOI|_ .. |Travis| image:: https://api.travis-ci.org/scikit-learn/scikit-learn.svg?branch=master .. _Travis: https://travis-ci.org/scikit-learn/scikit-learn .. |AppVeyor| image:: https://ci.appveyor.com/api/projects/status/github/scikit-learn/scikit-learn?branch=master&svg=true .. _AppVeyor: https://ci.appveyor.com/project/sklearn-ci/scikit-learn/history .. |Codecov| image:: https://codecov.io/github/scikit-learn/scikit-learn/badge.svg?branch=master&service=github .. _Codecov: https://codecov.io/github/scikit-learn/scikit-learn?branch=master .. |CircleCI| image:: https://circleci.com/gh/scikit-learn/scikit-learn/tree/master.svg?style=shield&circle-token=:circle-token .. _CircleCI: https://circleci.com/gh/scikit-learn/scikit-learn .. |Python27| image:: https://img.shields.io/badge/python-2.7-blue.svg .. _Python27: https://badge.fury.io/py/scikit-learn .. |Python35| image:: https://img.shields.io/badge/python-3.5-blue.svg .. _Python35: https://badge.fury.io/py/scikit-learn .. |PyPi| image:: https://badge.fury.io/py/scikit-learn.svg .. _PyPi: https://badge.fury.io/py/scikit-learn .. |DOI| image:: https://zenodo.org/badge/21369/scikit-learn/scikit-learn.svg .. _DOI: https://zenodo.org/badge/latestdoi/21369/scikit-learn/scikit-learn scikit-learn ============ scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the `AUTHORS.rst `_ file for a complete list of contributors. It is currently maintained by a team of volunteers. Website: http://scikit-learn.org Installation ------------ Dependencies ~~~~~~~~~~~~ scikit-learn requires: - Python (>= 2.7 or >= 3.3) - NumPy (>= 1.8.2) - SciPy (>= 0.13.3) For running the examples Matplotlib >= 1.1.1 is required. scikit-learn also uses CBLAS, the C interface to the Basic Linear Algebra Subprograms library. scikit-learn comes with a reference implementation, but the system CBLAS will be detected by the build system and used if present. CBLAS exists in many implementations; see `Linear algebra libraries `_ for known issues. User installation ~~~~~~~~~~~~~~~~~ If you already have a working installation of numpy and scipy, the easiest way to install scikit-learn is using ``pip`` :: pip install -U scikit-learn or ``conda``:: conda install scikit-learn The documentation includes more detailed `installation instructions `_. Development ----------- We welcome new contributors of all experience levels. The scikit-learn community goals are to be helpful, welcoming, and effective. The `Development Guide `_ has detailed information about contributing code, documentation, tests, and more. We've included some basic information in this README. Important links ~~~~~~~~~~~~~~~ - Official source code repo: https://github.com/scikit-learn/scikit-learn - Download releases: https://pypi.python.org/pypi/scikit-learn - Issue tracker: https://github.com/scikit-learn/scikit-learn/issues Source code ~~~~~~~~~~~ You can check the latest sources with the command:: git clone https://github.com/scikit-learn/scikit-learn.git Setting up a development environment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Quick tutorial on how to go about setting up your environment to contribute to scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md Testing ~~~~~~~ After installation, you can launch the test suite from outside the source directory (you will need to have the ``nose`` package installed):: nosetests -v sklearn Under Windows, it is recommended to use the following command (adjust the path to the ``python.exe`` program) as using the ``nosetests.exe`` program can badly interact with tests that use ``multiprocessing``:: C:\Python34\python.exe -c "import nose; nose.main()" -v sklearn See the web page http://scikit-learn.org/stable/developers/advanced_installation.html#testing for more information. Random number generation can be controlled during testing by setting the ``SKLEARN_SEED`` environment variable. Submitting a Pull Request ~~~~~~~~~~~~~~~~~~~~~~~~~ Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: http://scikit-learn.org/stable/developers/index.html Project History --------------- The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the `AUTHORS.rst `_ file for a complete list of contributors. The project is currently maintained by a team of volunteers. **Note**: `scikit-learn` was previously referred to as `scikits.learn`. Help and Support ---------------- Documentation ~~~~~~~~~~~~~ - HTML documentation (stable release): http://scikit-learn.org - HTML documentation (development version): http://scikit-learn.org/dev/ - FAQ: http://scikit-learn.org/stable/faq.html Communication ~~~~~~~~~~~~~ - Mailing list: https://mail.python.org/mailman/listinfo/scikit-learn - IRC channel: ``#scikit-learn`` at ``webchat.freenode.net`` - Stack Overflow: http://stackoverflow.com/questions/tagged/scikit-learn - Website: http://scikit-learn.org Citation ~~~~~~~~ If you use scikit-learn in a scientific publication, we would appreciate citations: http://scikit-learn.org/stable/about.html#citing-scikit-learn scikit-learn-0.19.1/appveyor.yml000066400000000000000000000075141317344356400166000ustar00rootroot00000000000000# AppVeyor.com is a Continuous Integration service to build and run tests under # Windows # https://ci.appveyor.com/project/sklearn-ci/scikit-learn environment: global: # SDK v7.0 MSVC Express 2008's SetEnv.cmd script will fail if the # /E:ON and /V:ON options are not enabled in the batch script interpreter # See: http://stackoverflow.com/a/13751649/163740 CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\build_tools\\appveyor\\run_with_env.cmd" WHEELHOUSE_UPLOADER_USERNAME: sklearn-appveyor WHEELHOUSE_UPLOADER_SECRET: secure: BQm8KfEj6v2Y+dQxb2syQvTFxDnHXvaNktkLcYSq7jfbTOO6eH9n09tfQzFUVcWZ # Make sure we don't download large datasets when running the test on # continuous integration platform SKLEARN_SKIP_NETWORK_TESTS: 1 matrix: - PYTHON: "C:\\Python27" PYTHON_VERSION: "2.7.8" PYTHON_ARCH: "32" - PYTHON: "C:\\Python27-x64" PYTHON_VERSION: "2.7.8" PYTHON_ARCH: "64" - PYTHON: "C:\\Python35" PYTHON_VERSION: "3.5.0" PYTHON_ARCH: "32" - PYTHON: "C:\\Python35-x64" PYTHON_VERSION: "3.5.0" PYTHON_ARCH: "64" install: # If there is a newer build queued for the same PR, cancel this one. # The AppVeyor 'rollout builds' option is supposed to serve the same # purpose but is problematic because it tends to cancel builds pushed # directly to master instead of just PR builds. # credits: JuliaLang developers. - ps: if ($env:APPVEYOR_PULL_REQUEST_NUMBER -and $env:APPVEYOR_BUILD_NUMBER -ne ((Invoke-RestMethod ` https://ci.appveyor.com/api/projects/$env:APPVEYOR_ACCOUNT_NAME/$env:APPVEYOR_PROJECT_SLUG/history?recordsNumber=50).builds | ` Where-Object pullRequestId -eq $env:APPVEYOR_PULL_REQUEST_NUMBER)[0].buildNumber) { ` throw "There are newer queued builds for this pull request, failing early." } # Install Python (from the official .msi of http://python.org) and pip when # not already installed. - "powershell ./build_tools/appveyor/install.ps1" - "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%" - "python -m pip install -U pip" # Check that we have the expected version and architecture for Python - "python --version" - "python -c \"import struct; print(struct.calcsize('P') * 8)\"" - "pip --version" # Install the build and runtime dependencies of the project. - "%CMD_IN_ENV% pip install --timeout=60 --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r build_tools/appveyor/requirements.txt" - "%CMD_IN_ENV% python setup.py bdist_wheel bdist_wininst -b doc/logos/scikit-learn-logo.bmp" - ps: "ls dist" # Install the generated wheel package to test it - "pip install --pre --no-index --find-links dist/ scikit-learn" # Not a .NET project, we build scikit-learn in the install step instead build: false test_script: # Change to a non-source folder to make sure we run the tests on the # installed library. - "mkdir empty_folder" - "cd empty_folder" - "python -c \"import nose; nose.main()\" --with-timer --timer-top-n 20 -s sklearn" # Move back to the project folder - "cd .." artifacts: # Archive the generated wheel package in the ci.appveyor.com build report. - path: dist\* on_success: # Upload the generated wheel package to Rackspace # On Windows, Apache Libcloud cannot find a standard CA cert bundle so we # disable the ssl checks. - "python -m wheelhouse_uploader upload --no-ssl-check --local-folder=dist sklearn-windows-wheels" notifications: - provider: Webhook url: https://webhooks.gitter.im/e/0dc8e57cd38105aeb1b4 on_build_success: false on_build_failure: True cache: # Use the appveyor cache to avoid re-downloading large archives such # the MKL numpy and scipy wheels mirrored on a rackspace cloud # container, speed up the appveyor jobs and reduce bandwidth # usage on our rackspace account. - '%APPDATA%\pip\Cache' scikit-learn-0.19.1/benchmarks/000077500000000000000000000000001317344356400163165ustar00rootroot00000000000000scikit-learn-0.19.1/benchmarks/.gitignore000066400000000000000000000000511317344356400203020ustar00rootroot00000000000000/bhtsne *.npy *.json /mnist_tsne_output/ scikit-learn-0.19.1/benchmarks/bench_20newsgroups.py000066400000000000000000000067431317344356400224170ustar00rootroot00000000000000from __future__ import print_function, division from time import time import argparse import numpy as np from sklearn.dummy import DummyClassifier from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.metrics import accuracy_score from sklearn.utils.validation import check_array from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB ESTIMATORS = { "dummy": DummyClassifier(), "random_forest": RandomForestClassifier(n_estimators=100, max_features="sqrt", min_samples_split=10), "extra_trees": ExtraTreesClassifier(n_estimators=100, max_features="sqrt", min_samples_split=10), "logistic_regression": LogisticRegression(), "naive_bayes": MultinomialNB(), "adaboost": AdaBoostClassifier(n_estimators=10), } ############################################################################### # Data if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-e', '--estimators', nargs="+", required=True, choices=ESTIMATORS) args = vars(parser.parse_args()) data_train = fetch_20newsgroups_vectorized(subset="train") data_test = fetch_20newsgroups_vectorized(subset="test") X_train = check_array(data_train.data, dtype=np.float32, accept_sparse="csc") X_test = check_array(data_test.data, dtype=np.float32, accept_sparse="csr") y_train = data_train.target y_test = data_test.target print("20 newsgroups") print("=============") print("X_train.shape = {0}".format(X_train.shape)) print("X_train.format = {0}".format(X_train.format)) print("X_train.dtype = {0}".format(X_train.dtype)) print("X_train density = {0}" "".format(X_train.nnz / np.product(X_train.shape))) print("y_train {0}".format(y_train.shape)) print("X_test {0}".format(X_test.shape)) print("X_test.format = {0}".format(X_test.format)) print("X_test.dtype = {0}".format(X_test.dtype)) print("y_test {0}".format(y_test.shape)) print() print("Classifier Training") print("===================") accuracy, train_time, test_time = {}, {}, {} for name in sorted(args["estimators"]): clf = ESTIMATORS[name] try: clf.set_params(random_state=0) except (TypeError, ValueError): pass print("Training %s ... " % name, end="") t0 = time() clf.fit(X_train, y_train) train_time[name] = time() - t0 t0 = time() y_pred = clf.predict(X_test) test_time[name] = time() - t0 accuracy[name] = accuracy_score(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print() print("%s %s %s %s" % ("Classifier ", "train-time", "test-time", "Accuracy")) print("-" * 44) for name in sorted(accuracy, key=accuracy.get): print("%s %s %s %s" % (name.ljust(16), ("%.4fs" % train_time[name]).center(10), ("%.4fs" % test_time[name]).center(10), ("%.4f" % accuracy[name]).center(10))) print() scikit-learn-0.19.1/benchmarks/bench_covertype.py000066400000000000000000000163411317344356400220540ustar00rootroot00000000000000""" =========================== Covertype dataset benchmark =========================== Benchmark stochastic gradient descent (SGD), Liblinear, and Naive Bayes, CART (decision tree), RandomForest and Extra-Trees on the forest covertype dataset of Blackard, Jock, and Dean [1]. The dataset comprises 581,012 samples. It is low dimensional with 54 features and a sparsity of approx. 23%. Here, we consider the task of predicting class 1 (spruce/fir). The classification performance of SGD is competitive with Liblinear while being two orders of magnitude faster to train:: [..] Classification performance: =========================== Classifier train-time test-time error-rate -------------------------------------------- liblinear 15.9744s 0.0705s 0.2305 GaussianNB 3.0666s 0.3884s 0.4841 SGD 1.0558s 0.1152s 0.2300 CART 79.4296s 0.0523s 0.0469 RandomForest 1190.1620s 0.5881s 0.0243 ExtraTrees 640.3194s 0.6495s 0.0198 The same task has been used in a number of papers including: * `"SVM Optimization: Inverse Dependence on Training Set Size" `_ S. Shalev-Shwartz, N. Srebro - In Proceedings of ICML '08. * `"Pegasos: Primal estimated sub-gradient solver for svm" `_ S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07. * `"Training Linear SVMs in Linear Time" `_ T. Joachims - In SIGKDD '06 [1] http://archive.ics.uci.edu/ml/datasets/Covertype """ from __future__ import division, print_function # Author: Peter Prettenhofer # Arnaud Joly # License: BSD 3 clause import os from time import time import argparse import numpy as np from sklearn.datasets import fetch_covtype, get_data_home from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier, LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import zero_one_loss from sklearn.externals.joblib import Memory from sklearn.utils import check_array # Memoize the data extraction and memory map the resulting # train / test splits in readonly mode memory = Memory(os.path.join(get_data_home(), 'covertype_benchmark_data'), mmap_mode='r') @memory.cache def load_data(dtype=np.float32, order='C', random_state=13): """Load the data, then cache and memmap the train/test split""" ###################################################################### # Load dataset print("Loading dataset...") data = fetch_covtype(download_if_missing=True, shuffle=True, random_state=random_state) X = check_array(data['data'], dtype=dtype, order=order) y = (data['target'] != 1).astype(np.int) # Create train-test split (as [Joachims, 2006]) print("Creating train-test split...") n_train = 522911 X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] # Standardize first 10 features (the numerical ones) mean = X_train.mean(axis=0) std = X_train.std(axis=0) mean[10:] = 0.0 std[10:] = 1.0 X_train = (X_train - mean) / std X_test = (X_test - mean) / std return X_train, X_test, y_train, y_test ESTIMATORS = { 'GBRT': GradientBoostingClassifier(n_estimators=250), 'ExtraTrees': ExtraTreesClassifier(n_estimators=20), 'RandomForest': RandomForestClassifier(n_estimators=20), 'CART': DecisionTreeClassifier(min_samples_split=5), 'SGD': SGDClassifier(alpha=0.001, max_iter=1000, tol=1e-3), 'GaussianNB': GaussianNB(), 'liblinear': LinearSVC(loss="l2", penalty="l2", C=1000, dual=False, tol=1e-3), 'SAG': LogisticRegression(solver='sag', max_iter=2, C=1000) } if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--classifiers', nargs="+", choices=ESTIMATORS, type=str, default=['liblinear', 'GaussianNB', 'SGD', 'CART'], help="list of classifiers to benchmark.") parser.add_argument('--n-jobs', nargs="?", default=1, type=int, help="Number of concurrently running workers for " "models that support parallelism.") parser.add_argument('--order', nargs="?", default="C", type=str, choices=["F", "C"], help="Allow to choose between fortran and C ordered " "data") parser.add_argument('--random-seed', nargs="?", default=13, type=int, help="Common seed used by random number generator.") args = vars(parser.parse_args()) print(__doc__) X_train, X_test, y_train, y_test = load_data( order=args["order"], random_state=args["random_seed"]) print("") print("Dataset statistics:") print("===================") print("%s %d" % ("number of features:".ljust(25), X_train.shape[1])) print("%s %d" % ("number of classes:".ljust(25), np.unique(y_train).size)) print("%s %s" % ("data type:".ljust(25), X_train.dtype)) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of train samples:".ljust(25), X_train.shape[0], np.sum(y_train == 1), np.sum(y_train == 0), int(X_train.nbytes / 1e6))) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of test samples:".ljust(25), X_test.shape[0], np.sum(y_test == 1), np.sum(y_test == 0), int(X_test.nbytes / 1e6))) print() print("Training Classifiers") print("====================") error, train_time, test_time = {}, {}, {} for name in sorted(args["classifiers"]): print("Training %s ... " % name, end="") estimator = ESTIMATORS[name] estimator_params = estimator.get_params() estimator.set_params(**{p: args["random_seed"] for p in estimator_params if p.endswith("random_state")}) if "n_jobs" in estimator_params: estimator.set_params(n_jobs=args["n_jobs"]) time_start = time() estimator.fit(X_train, y_train) train_time[name] = time() - time_start time_start = time() y_pred = estimator.predict(X_test) test_time[name] = time() - time_start error[name] = zero_one_loss(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print("%s %s %s %s" % ("Classifier ", "train-time", "test-time", "error-rate")) print("-" * 44) for name in sorted(args["classifiers"], key=error.get): print("%s %s %s %s" % (name.ljust(12), ("%.4fs" % train_time[name]).center(10), ("%.4fs" % test_time[name]).center(10), ("%.4f" % error[name]).center(10))) print() scikit-learn-0.19.1/benchmarks/bench_glm.py000066400000000000000000000027531317344356400206150ustar00rootroot00000000000000""" A comparison of different methods in GLM Data comes from a random square matrix. """ from datetime import datetime import numpy as np from sklearn import linear_model from sklearn.utils.bench import total_seconds if __name__ == '__main__': import matplotlib.pyplot as plt n_iter = 40 time_ridge = np.empty(n_iter) time_ols = np.empty(n_iter) time_lasso = np.empty(n_iter) dimensions = 500 * np.arange(1, n_iter + 1) for i in range(n_iter): print('Iteration %s of %s' % (i, n_iter)) n_samples, n_features = 10 * i + 3, 10 * i + 3 X = np.random.randn(n_samples, n_features) Y = np.random.randn(n_samples) start = datetime.now() ridge = linear_model.Ridge(alpha=1.) ridge.fit(X, Y) time_ridge[i] = total_seconds(datetime.now() - start) start = datetime.now() ols = linear_model.LinearRegression() ols.fit(X, Y) time_ols[i] = total_seconds(datetime.now() - start) start = datetime.now() lasso = linear_model.LassoLars() lasso.fit(X, Y) time_lasso[i] = total_seconds(datetime.now() - start) plt.figure('scikit-learn GLM benchmark results') plt.xlabel('Dimensions') plt.ylabel('Time (s)') plt.plot(dimensions, time_ridge, color='r') plt.plot(dimensions, time_ols, color='g') plt.plot(dimensions, time_lasso, color='b') plt.legend(['Ridge', 'OLS', 'LassoLars'], loc='upper left') plt.axis('tight') plt.show() scikit-learn-0.19.1/benchmarks/bench_glmnet.py000066400000000000000000000074621317344356400213260ustar00rootroot00000000000000""" To run this, you'll need to have installed. * glmnet-python * scikit-learn (of course) Does two benchmarks First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import numpy as np import gc from time import time from sklearn.datasets.samples_generator import make_regression alpha = 0.1 # alpha = 0.01 def rmse(a, b): return np.sqrt(np.mean((a - b) ** 2)) def bench(factory, X, Y, X_test, Y_test, ref_coef): gc.collect() # start time tstart = time() clf = factory(alpha=alpha).fit(X, Y) delta = (time() - tstart) # stop time print("duration: %0.3fs" % delta) print("rmse: %f" % rmse(Y_test, clf.predict(X_test))) print("mean coef abs diff: %f" % abs(ref_coef - clf.coef_.ravel()).mean()) return delta if __name__ == '__main__': from glmnet.elastic_net import Lasso as GlmnetLasso from sklearn.linear_model import Lasso as ScikitLasso # Delayed import of matplotlib.pyplot import matplotlib.pyplot as plt scikit_results = [] glmnet_results = [] n = 20 step = 500 n_features = 1000 n_informative = n_features / 10 n_test_samples = 1000 for i in range(1, n + 1): print('==================') print('Iteration %s of %s' % (i, n)) print('==================') X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:(i * step)] Y = Y[:(i * step)] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) plt.clf() xx = range(0, n * step, step) plt.title('Lasso regression on sample dataset (%d features)' % n_features) plt.plot(xx, scikit_results, 'b-', label='scikit-learn') plt.plot(xx, glmnet_results, 'r-', label='glmnet') plt.legend() plt.xlabel('number of samples to classify') plt.ylabel('Time (s)') plt.show() # now do a benchmark where the number of points is fixed # and the variable is the number of features scikit_results = [] glmnet_results = [] n = 20 step = 100 n_samples = 500 for i in range(1, n + 1): print('==================') print('Iteration %02d of %02d' % (i, n)) print('==================') n_features = i * step n_informative = n_features / 10 X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:n_samples] Y = Y[:n_samples] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) xx = np.arange(100, 100 + n * step, step) plt.figure('scikit-learn vs. glmnet benchmark results') plt.title('Regression in high dimensional spaces (%d samples)' % n_samples) plt.plot(xx, scikit_results, 'b-', label='scikit-learn') plt.plot(xx, glmnet_results, 'r-', label='glmnet') plt.legend() plt.xlabel('number of features') plt.ylabel('Time (s)') plt.axis('tight') plt.show() scikit-learn-0.19.1/benchmarks/bench_isolation_forest.py000066400000000000000000000112611317344356400234130ustar00rootroot00000000000000""" ========================================== IsolationForest benchmark ========================================== A test of IsolationForest on classical anomaly detection datasets. """ from time import time import numpy as np import matplotlib.pyplot as plt from sklearn.ensemble import IsolationForest from sklearn.metrics import roc_curve, auc from sklearn.datasets import fetch_kddcup99, fetch_covtype, fetch_mldata from sklearn.preprocessing import MultiLabelBinarizer from sklearn.utils import shuffle as sh print(__doc__) def print_outlier_ratio(y): """ Helper function to show the distinct value count of element in the target. Useful indicator for the datasets used in bench_isolation_forest.py. """ uniq, cnt = np.unique(y, return_counts=True) print("----- Target count values: ") for u, c in zip(uniq, cnt): print("------ %s -> %d occurrences" % (str(u), c)) print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y))) np.random.seed(1) fig_roc, ax_roc = plt.subplots(1, 1, figsize=(8, 5)) # Set this to true for plotting score histograms for each dataset: with_decision_function_histograms = False # Removed the shuttle dataset because as of 2017-03-23 mldata.org is down: # datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover'] # Loop over all datasets for fitting and scoring the estimator: for dat in datasets: # Loading and vectorizing the data: print('====== %s ======' % dat) print('--- Fetching data...') if dat in ['http', 'smtp', 'SF', 'SA']: dataset = fetch_kddcup99(subset=dat, shuffle=True, percent10=True) X = dataset.data y = dataset.target if dat == 'shuttle': dataset = fetch_mldata('shuttle') X = dataset.data y = dataset.target X, y = sh(X, y) # we remove data with label 4 # normal data are then those of class 1 s = (y != 4) X = X[s, :] y = y[s] y = (y != 1).astype(int) print('----- ') if dat == 'forestcover': dataset = fetch_covtype(shuffle=True) X = dataset.data y = dataset.target # normal data are those with attribute 2 # abnormal those with attribute 4 s = (y == 2) + (y == 4) X = X[s, :] y = y[s] y = (y != 2).astype(int) print_outlier_ratio(y) print('--- Vectorizing data...') if dat == 'SF': lb = MultiLabelBinarizer() x1 = lb.fit_transform(X[:, 1]) X = np.c_[X[:, :1], x1, X[:, 2:]] y = (y != b'normal.').astype(int) print_outlier_ratio(y) if dat == 'SA': lb = MultiLabelBinarizer() x1 = lb.fit_transform(X[:, 1]) x2 = lb.fit_transform(X[:, 2]) x3 = lb.fit_transform(X[:, 3]) X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]] y = (y != b'normal.').astype(int) print_outlier_ratio(y) if dat in ('http', 'smtp'): y = (y != b'normal.').astype(int) print_outlier_ratio(y) n_samples, n_features = X.shape n_samples_train = n_samples // 2 X = X.astype(float) X_train = X[:n_samples_train, :] X_test = X[n_samples_train:, :] y_train = y[:n_samples_train] y_test = y[n_samples_train:] print('--- Fitting the IsolationForest estimator...') model = IsolationForest(n_jobs=-1) tstart = time() model.fit(X_train) fit_time = time() - tstart tstart = time() scoring = - model.decision_function(X_test) # the lower, the more abnormal print("--- Preparing the plot elements...") if with_decision_function_histograms: fig, ax = plt.subplots(3, sharex=True, sharey=True) bins = np.linspace(-0.5, 0.5, 200) ax[0].hist(scoring, bins, color='black') ax[0].set_title('Decision function for %s dataset' % dat) ax[1].hist(scoring[y_test == 0], bins, color='b', label='normal data') ax[1].legend(loc="lower right") ax[2].hist(scoring[y_test == 1], bins, color='r', label='outliers') ax[2].legend(loc="lower right") # Show ROC Curves predict_time = time() - tstart fpr, tpr, thresholds = roc_curve(y_test, scoring) auc_score = auc(fpr, tpr) label = ('%s (AUC: %0.3f, train_time= %0.2fs, ' 'test_time= %0.2fs)' % (dat, auc_score, fit_time, predict_time)) # Print AUC score and train/test time: print(label) ax_roc.plot(fpr, tpr, lw=1, label=label) ax_roc.set_xlim([-0.05, 1.05]) ax_roc.set_ylim([-0.05, 1.05]) ax_roc.set_xlabel('False Positive Rate') ax_roc.set_ylabel('True Positive Rate') ax_roc.set_title('Receiver operating characteristic (ROC) curves') ax_roc.legend(loc="lower right") fig_roc.tight_layout() plt.show() scikit-learn-0.19.1/benchmarks/bench_isotonic.py000066400000000000000000000066021317344356400216620ustar00rootroot00000000000000""" Benchmarks of isotonic regression performance. We generate a synthetic dataset of size 10^n, for n in [min, max], and examine the time taken to run isotonic regression over the dataset. The timings are then output to stdout, or visualized on a log-log scale with matplotlib. This allows the scaling of the algorithm with the problem size to be visualized and understood. """ from __future__ import print_function import numpy as np import gc from datetime import datetime from sklearn.isotonic import isotonic_regression from sklearn.utils.bench import total_seconds import matplotlib.pyplot as plt import argparse def generate_perturbed_logarithm_dataset(size): return (np.random.randint(-50, 50, size=size) + 50. * np.log(1 + np.arange(size))) def generate_logistic_dataset(size): X = np.sort(np.random.normal(size=size)) return np.random.random(size=size) < 1.0 / (1.0 + np.exp(-X)) def generate_pathological_dataset(size): # Triggers O(n^2) complexity on the original implementation. return np.r_[np.arange(size), np.arange(-(size - 1), size), np.arange(-(size - 1), 1)] DATASET_GENERATORS = { 'perturbed_logarithm': generate_perturbed_logarithm_dataset, 'logistic': generate_logistic_dataset, 'pathological': generate_pathological_dataset, } def bench_isotonic_regression(Y): """ Runs a single iteration of isotonic regression on the input data, and reports the total time taken (in seconds). """ gc.collect() tstart = datetime.now() isotonic_regression(Y) delta = datetime.now() - tstart return total_seconds(delta) if __name__ == '__main__': parser = argparse.ArgumentParser( description="Isotonic Regression benchmark tool") parser.add_argument('--seed', type=int, help="RNG seed") parser.add_argument('--iterations', type=int, required=True, help="Number of iterations to average timings over " "for each problem size") parser.add_argument('--log_min_problem_size', type=int, required=True, help="Base 10 logarithm of the minimum problem size") parser.add_argument('--log_max_problem_size', type=int, required=True, help="Base 10 logarithm of the maximum problem size") parser.add_argument('--show_plot', action='store_true', help="Plot timing output with matplotlib") parser.add_argument('--dataset', choices=DATASET_GENERATORS.keys(), required=True) args = parser.parse_args() np.random.seed(args.seed) timings = [] for exponent in range(args.log_min_problem_size, args.log_max_problem_size): n = 10 ** exponent Y = DATASET_GENERATORS[args.dataset](n) time_per_iteration = \ [bench_isotonic_regression(Y) for i in range(args.iterations)] timing = (n, np.mean(time_per_iteration)) timings.append(timing) # If we're not plotting, dump the timing to stdout if not args.show_plot: print(n, np.mean(time_per_iteration)) if args.show_plot: plt.plot(*zip(*timings)) plt.title("Average time taken running isotonic regression") plt.xlabel('Number of observations') plt.ylabel('Time (s)') plt.axis('tight') plt.loglog() plt.show() scikit-learn-0.19.1/benchmarks/bench_lasso.py000066400000000000000000000064441317344356400211600ustar00rootroot00000000000000""" Benchmarks of Lasso vs LassoLars First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import gc from time import time import numpy as np from sklearn.datasets.samples_generator import make_regression def compute_bench(alpha, n_samples, n_features, precompute): lasso_results = [] lars_lasso_results = [] it = 0 for ns in n_samples: for nf in n_features: it += 1 print('==================') print('Iteration %s of %s' % (it, max(len(n_samples), len(n_features)))) print('==================') n_informative = nf // 10 X, Y, coef_ = make_regression(n_samples=ns, n_features=nf, n_informative=n_informative, noise=0.1, coef=True) X /= np.sqrt(np.sum(X ** 2, axis=0)) # Normalize data gc.collect() print("- benchmarking Lasso") clf = Lasso(alpha=alpha, fit_intercept=False, precompute=precompute) tstart = time() clf.fit(X, Y) lasso_results.append(time() - tstart) gc.collect() print("- benchmarking LassoLars") clf = LassoLars(alpha=alpha, fit_intercept=False, normalize=False, precompute=precompute) tstart = time() clf.fit(X, Y) lars_lasso_results.append(time() - tstart) return lasso_results, lars_lasso_results if __name__ == '__main__': from sklearn.linear_model import Lasso, LassoLars import matplotlib.pyplot as plt alpha = 0.01 # regularization parameter n_features = 10 list_n_samples = np.linspace(100, 1000000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, list_n_samples, [n_features], precompute=True) plt.figure('scikit-learn LASSO benchmark results') plt.subplot(211) plt.plot(list_n_samples, lasso_results, 'b-', label='Lasso') plt.plot(list_n_samples, lars_lasso_results, 'r-', label='LassoLars') plt.title('precomputed Gram matrix, %d features, alpha=%s' % (n_features, alpha)) plt.legend(loc='upper left') plt.xlabel('number of samples') plt.ylabel('Time (s)') plt.axis('tight') n_samples = 2000 list_n_features = np.linspace(500, 3000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, [n_samples], list_n_features, precompute=False) plt.subplot(212) plt.plot(list_n_features, lasso_results, 'b-', label='Lasso') plt.plot(list_n_features, lars_lasso_results, 'r-', label='LassoLars') plt.title('%d samples, alpha=%s' % (n_samples, alpha)) plt.legend(loc='upper left') plt.xlabel('number of features') plt.ylabel('Time (s)') plt.axis('tight') plt.show() scikit-learn-0.19.1/benchmarks/bench_lof.py000066400000000000000000000067341317344356400206210ustar00rootroot00000000000000""" ============================ LocalOutlierFactor benchmark ============================ A test of LocalOutlierFactor on classical anomaly detection datasets. """ from time import time import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import LocalOutlierFactor from sklearn.metrics import roc_curve, auc from sklearn.datasets import fetch_kddcup99, fetch_covtype, fetch_mldata from sklearn.preprocessing import LabelBinarizer from sklearn.utils import shuffle as sh print(__doc__) np.random.seed(2) # datasets available: ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] datasets = ['shuttle'] novelty_detection = True # if False, training set polluted by outliers for dataset_name in datasets: # loading and vectorization print('loading data') if dataset_name in ['http', 'smtp', 'SA', 'SF']: dataset = fetch_kddcup99(subset=dataset_name, shuffle=True, percent10=False) X = dataset.data y = dataset.target if dataset_name == 'shuttle': dataset = fetch_mldata('shuttle') X = dataset.data y = dataset.target X, y = sh(X, y) # we remove data with label 4 # normal data are then those of class 1 s = (y != 4) X = X[s, :] y = y[s] y = (y != 1).astype(int) if dataset_name == 'forestcover': dataset = fetch_covtype(shuffle=True) X = dataset.data y = dataset.target # normal data are those with attribute 2 # abnormal those with attribute 4 s = (y == 2) + (y == 4) X = X[s, :] y = y[s] y = (y != 2).astype(int) print('vectorizing data') if dataset_name == 'SF': lb = LabelBinarizer() lb.fit(X[:, 1]) x1 = lb.transform(X[:, 1]) X = np.c_[X[:, :1], x1, X[:, 2:]] y = (y != 'normal.').astype(int) if dataset_name == 'SA': lb = LabelBinarizer() lb.fit(X[:, 1]) x1 = lb.transform(X[:, 1]) lb.fit(X[:, 2]) x2 = lb.transform(X[:, 2]) lb.fit(X[:, 3]) x3 = lb.transform(X[:, 3]) X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]] y = (y != 'normal.').astype(int) if dataset_name == 'http' or dataset_name == 'smtp': y = (y != 'normal.').astype(int) n_samples, n_features = np.shape(X) n_samples_train = n_samples // 2 n_samples_test = n_samples - n_samples_train X = X.astype(float) X_train = X[:n_samples_train, :] X_test = X[n_samples_train:, :] y_train = y[:n_samples_train] y_test = y[n_samples_train:] if novelty_detection: X_train = X_train[y_train == 0] y_train = y_train[y_train == 0] print('LocalOutlierFactor processing...') model = LocalOutlierFactor(n_neighbors=20) tstart = time() model.fit(X_train) fit_time = time() - tstart tstart = time() scoring = -model.decision_function(X_test) # the lower, the more normal predict_time = time() - tstart fpr, tpr, thresholds = roc_curve(y_test, scoring) AUC = auc(fpr, tpr) plt.plot(fpr, tpr, lw=1, label=('ROC for %s (area = %0.3f, train-time: %0.2fs,' 'test-time: %0.2fs)' % (dataset_name, AUC, fit_time, predict_time))) plt.xlim([-0.05, 1.05]) plt.ylim([-0.05, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic') plt.legend(loc="lower right") plt.show() scikit-learn-0.19.1/benchmarks/bench_mnist.py000066400000000000000000000155011317344356400211630ustar00rootroot00000000000000""" ======================= MNIST dataset benchmark ======================= Benchmark on the MNIST dataset. The dataset comprises 70,000 samples and 784 features. Here, we consider the task of predicting 10 classes - digits from 0 to 9 from their raw images. By contrast to the covertype dataset, the feature space is homogenous. Example of output : [..] Classification performance: =========================== Classifier train-time test-time error-rate ------------------------------------------------------------ MLP_adam 53.46s 0.11s 0.0224 Nystroem-SVM 112.97s 0.92s 0.0228 MultilayerPerceptron 24.33s 0.14s 0.0287 ExtraTrees 42.99s 0.57s 0.0294 RandomForest 42.70s 0.49s 0.0318 SampledRBF-SVM 135.81s 0.56s 0.0486 LinearRegression-SAG 16.67s 0.06s 0.0824 CART 20.69s 0.02s 0.1219 dummy 0.00s 0.01s 0.8973 """ from __future__ import division, print_function # Author: Issam H. Laradji # Arnaud Joly # License: BSD 3 clause import os from time import time import argparse import numpy as np from sklearn.datasets import fetch_mldata from sklearn.datasets import get_data_home from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.dummy import DummyClassifier from sklearn.externals.joblib import Memory from sklearn.kernel_approximation import Nystroem from sklearn.kernel_approximation import RBFSampler from sklearn.metrics import zero_one_loss from sklearn.pipeline import make_pipeline from sklearn.svm import LinearSVC from sklearn.tree import DecisionTreeClassifier from sklearn.utils import check_array from sklearn.linear_model import LogisticRegression from sklearn.neural_network import MLPClassifier # Memoize the data extraction and memory map the resulting # train / test splits in readonly mode memory = Memory(os.path.join(get_data_home(), 'mnist_benchmark_data'), mmap_mode='r') @memory.cache def load_data(dtype=np.float32, order='F'): """Load the data, then cache and memmap the train/test split""" ###################################################################### # Load dataset print("Loading dataset...") data = fetch_mldata('MNIST original') X = check_array(data['data'], dtype=dtype, order=order) y = data["target"] # Normalize features X = X / 255 # Create train-test split (as [Joachims, 2006]) print("Creating train-test split...") n_train = 60000 X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] return X_train, X_test, y_train, y_test ESTIMATORS = { "dummy": DummyClassifier(), 'CART': DecisionTreeClassifier(), 'ExtraTrees': ExtraTreesClassifier(n_estimators=100), 'RandomForest': RandomForestClassifier(n_estimators=100), 'Nystroem-SVM': make_pipeline( Nystroem(gamma=0.015, n_components=1000), LinearSVC(C=100)), 'SampledRBF-SVM': make_pipeline( RBFSampler(gamma=0.015, n_components=1000), LinearSVC(C=100)), 'LogisticRegression-SAG': LogisticRegression(solver='sag', tol=1e-1, C=1e4), 'LogisticRegression-SAGA': LogisticRegression(solver='saga', tol=1e-1, C=1e4), 'MultilayerPerceptron': MLPClassifier( hidden_layer_sizes=(100, 100), max_iter=400, alpha=1e-4, solver='sgd', learning_rate_init=0.2, momentum=0.9, verbose=1, tol=1e-4, random_state=1), 'MLP-adam': MLPClassifier( hidden_layer_sizes=(100, 100), max_iter=400, alpha=1e-4, solver='adam', learning_rate_init=0.001, verbose=1, tol=1e-4, random_state=1) } if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--classifiers', nargs="+", choices=ESTIMATORS, type=str, default=['ExtraTrees', 'Nystroem-SVM'], help="list of classifiers to benchmark.") parser.add_argument('--n-jobs', nargs="?", default=1, type=int, help="Number of concurrently running workers for " "models that support parallelism.") parser.add_argument('--order', nargs="?", default="C", type=str, choices=["F", "C"], help="Allow to choose between fortran and C ordered " "data") parser.add_argument('--random-seed', nargs="?", default=0, type=int, help="Common seed used by random number generator.") args = vars(parser.parse_args()) print(__doc__) X_train, X_test, y_train, y_test = load_data(order=args["order"]) print("") print("Dataset statistics:") print("===================") print("%s %d" % ("number of features:".ljust(25), X_train.shape[1])) print("%s %d" % ("number of classes:".ljust(25), np.unique(y_train).size)) print("%s %s" % ("data type:".ljust(25), X_train.dtype)) print("%s %d (size=%dMB)" % ("number of train samples:".ljust(25), X_train.shape[0], int(X_train.nbytes / 1e6))) print("%s %d (size=%dMB)" % ("number of test samples:".ljust(25), X_test.shape[0], int(X_test.nbytes / 1e6))) print() print("Training Classifiers") print("====================") error, train_time, test_time = {}, {}, {} for name in sorted(args["classifiers"]): print("Training %s ... " % name, end="") estimator = ESTIMATORS[name] estimator_params = estimator.get_params() estimator.set_params(**{p: args["random_seed"] for p in estimator_params if p.endswith("random_state")}) if "n_jobs" in estimator_params: estimator.set_params(n_jobs=args["n_jobs"]) time_start = time() estimator.fit(X_train, y_train) train_time[name] = time() - time_start time_start = time() y_pred = estimator.predict(X_test) test_time[name] = time() - time_start error[name] = zero_one_loss(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print("{0: <24} {1: >10} {2: >11} {3: >12}" "".format("Classifier ", "train-time", "test-time", "error-rate")) print("-" * 60) for name in sorted(args["classifiers"], key=error.get): print("{0: <23} {1: >10.2f}s {2: >10.2f}s {3: >12.4f}" "".format(name, train_time[name], test_time[name], error[name])) print() scikit-learn-0.19.1/benchmarks/bench_multilabel_metrics.py000077500000000000000000000157421317344356400237230ustar00rootroot00000000000000#!/usr/bin/env python """ A comparison of multilabel target formats and metrics over them """ from __future__ import division from __future__ import print_function from timeit import timeit from functools import partial import itertools import argparse import sys import matplotlib.pyplot as plt import scipy.sparse as sp import numpy as np from sklearn.datasets import make_multilabel_classification from sklearn.metrics import (f1_score, accuracy_score, hamming_loss, jaccard_similarity_score) from sklearn.utils.testing import ignore_warnings METRICS = { 'f1': partial(f1_score, average='micro'), 'f1-by-sample': partial(f1_score, average='samples'), 'accuracy': accuracy_score, 'hamming': hamming_loss, 'jaccard': jaccard_similarity_score, } FORMATS = { 'sequences': lambda y: [list(np.flatnonzero(s)) for s in y], 'dense': lambda y: y, 'csr': lambda y: sp.csr_matrix(y), 'csc': lambda y: sp.csc_matrix(y), } @ignore_warnings def benchmark(metrics=tuple(v for k, v in sorted(METRICS.items())), formats=tuple(v for k, v in sorted(FORMATS.items())), samples=1000, classes=4, density=.2, n_times=5): """Times metric calculations for a number of inputs Parameters ---------- metrics : array-like of callables (1d or 0d) The metric functions to time. formats : array-like of callables (1d or 0d) These may transform a dense indicator matrix into multilabel representation. samples : array-like of ints (1d or 0d) The number of samples to generate as input. classes : array-like of ints (1d or 0d) The number of classes in the input. density : array-like of ints (1d or 0d) The density of positive labels in the input. n_times : int Time calling the metric n_times times. Returns ------- array of floats shaped like (metrics, formats, samples, classes, density) Time in seconds. """ metrics = np.atleast_1d(metrics) samples = np.atleast_1d(samples) classes = np.atleast_1d(classes) density = np.atleast_1d(density) formats = np.atleast_1d(formats) out = np.zeros((len(metrics), len(formats), len(samples), len(classes), len(density)), dtype=float) it = itertools.product(samples, classes, density) for i, (s, c, d) in enumerate(it): _, y_true = make_multilabel_classification(n_samples=s, n_features=1, n_classes=c, n_labels=d * c, random_state=42) _, y_pred = make_multilabel_classification(n_samples=s, n_features=1, n_classes=c, n_labels=d * c, random_state=84) for j, f in enumerate(formats): f_true = f(y_true) f_pred = f(y_pred) for k, metric in enumerate(metrics): t = timeit(partial(metric, f_true, f_pred), number=n_times) out[k, j].flat[i] = t return out def _tabulate(results, metrics, formats): """Prints results by metric and format Uses the last ([-1]) value of other fields """ column_width = max(max(len(k) for k in formats) + 1, 8) first_width = max(len(k) for k in metrics) head_fmt = ('{:<{fw}s}' + '{:>{cw}s}' * len(formats)) row_fmt = ('{:<{fw}s}' + '{:>{cw}.3f}' * len(formats)) print(head_fmt.format('Metric', *formats, cw=column_width, fw=first_width)) for metric, row in zip(metrics, results[:, :, -1, -1, -1]): print(row_fmt.format(metric, *row, cw=column_width, fw=first_width)) def _plot(results, metrics, formats, title, x_ticks, x_label, format_markers=('x', '|', 'o', '+'), metric_colors=('c', 'm', 'y', 'k', 'g', 'r', 'b')): """ Plot the results by metric, format and some other variable given by x_label """ fig = plt.figure('scikit-learn multilabel metrics benchmarks') plt.title(title) ax = fig.add_subplot(111) for i, metric in enumerate(metrics): for j, format in enumerate(formats): ax.plot(x_ticks, results[i, j].flat, label='{}, {}'.format(metric, format), marker=format_markers[j], color=metric_colors[i % len(metric_colors)]) ax.set_xlabel(x_label) ax.set_ylabel('Time (s)') ax.legend() plt.show() if __name__ == "__main__": ap = argparse.ArgumentParser() ap.add_argument('metrics', nargs='*', default=sorted(METRICS), help='Specifies metrics to benchmark, defaults to all. ' 'Choices are: {}'.format(sorted(METRICS))) ap.add_argument('--formats', nargs='+', choices=sorted(FORMATS), help='Specifies multilabel formats to benchmark ' '(defaults to all).') ap.add_argument('--samples', type=int, default=1000, help='The number of samples to generate') ap.add_argument('--classes', type=int, default=10, help='The number of classes') ap.add_argument('--density', type=float, default=.2, help='The average density of labels per sample') ap.add_argument('--plot', choices=['classes', 'density', 'samples'], default=None, help='Plot time with respect to this parameter varying ' 'up to the specified value') ap.add_argument('--n-steps', default=10, type=int, help='Plot this many points for each metric') ap.add_argument('--n-times', default=5, type=int, help="Time performance over n_times trials") args = ap.parse_args() if args.plot is not None: max_val = getattr(args, args.plot) if args.plot in ('classes', 'samples'): min_val = 2 else: min_val = 0 steps = np.linspace(min_val, max_val, num=args.n_steps + 1)[1:] if args.plot in ('classes', 'samples'): steps = np.unique(np.round(steps).astype(int)) setattr(args, args.plot, steps) if args.metrics is None: args.metrics = sorted(METRICS) if args.formats is None: args.formats = sorted(FORMATS) results = benchmark([METRICS[k] for k in args.metrics], [FORMATS[k] for k in args.formats], args.samples, args.classes, args.density, args.n_times) _tabulate(results, args.metrics, args.formats) if args.plot is not None: print('Displaying plot', file=sys.stderr) title = ('Multilabel metrics with %s' % ', '.join('{0}={1}'.format(field, getattr(args, field)) for field in ['samples', 'classes', 'density'] if args.plot != field)) _plot(results, args.metrics, args.formats, title, steps, args.plot) scikit-learn-0.19.1/benchmarks/bench_plot_fastkmeans.py000066400000000000000000000110731317344356400232230ustar00rootroot00000000000000from __future__ import print_function from collections import defaultdict from time import time import six import numpy as np from numpy import random as nr from sklearn.cluster.k_means_ import KMeans, MiniBatchKMeans def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) chunk = 100 max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() data = nr.randint(-50, 51, (n_samples, n_features)) print('K-Means') tstart = time() kmeans = KMeans(init='k-means++', n_clusters=10).fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.5f" % kmeans.inertia_) print() results['kmeans_speed'].append(delta) results['kmeans_quality'].append(kmeans.inertia_) print('Fast K-Means') # let's prepare the data in small chunks mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=10, batch_size=chunk) tstart = time() mbkmeans.fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %f" % mbkmeans.inertia_) print() print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results def compute_bench_2(chunks): results = defaultdict(lambda: []) n_features = 50000 means = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0.5, 0.5], [0.75, -0.5], [-1, 0.75], [1, 0]]) X = np.empty((0, 2)) for i in range(8): X = np.r_[X, means[i] + 0.8 * np.random.randn(n_features, 2)] max_it = len(chunks) it = 0 for chunk in chunks: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() print('Fast K-Means') tstart = time() mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=8, batch_size=chunk) mbkmeans.fit(X) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.3fs" % mbkmeans.inertia_) print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(50, 150, 5).astype(np.int) features_range = np.linspace(150, 50000, 5).astype(np.int) chunks = np.linspace(500, 10000, 15).astype(np.int) results = compute_bench(samples_range, features_range) results_2 = compute_bench_2(chunks) max_time = max([max(i) for i in [t for (label, t) in six.iteritems(results) if "speed" in label]]) max_inertia = max([max(i) for i in [ t for (label, t) in six.iteritems(results) if "speed" not in label]]) fig = plt.figure('scikit-learn K-Means benchmark results') for c, (label, timings) in zip('brcy', sorted(six.iteritems(results))): if 'speed' in label: ax = fig.add_subplot(2, 2, 1, projection='3d') ax.set_zlim3d(0.0, max_time * 1.1) else: ax = fig.add_subplot(2, 2, 2, projection='3d') ax.set_zlim3d(0.0, max_inertia * 1.1) X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.5) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') i = 0 for c, (label, timings) in zip('br', sorted(six.iteritems(results_2))): i += 1 ax = fig.add_subplot(2, 2, i + 2) y = np.asarray(timings) ax.plot(chunks, y, color=c, alpha=0.8) ax.set_xlabel('Chunks') ax.set_ylabel(label) plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_incremental_pca.py000066400000000000000000000144361317344356400242210ustar00rootroot00000000000000""" ======================== IncrementalPCA benchmark ======================== Benchmarks for IncrementalPCA """ import numpy as np import gc from time import time from collections import defaultdict import matplotlib.pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import IncrementalPCA, RandomizedPCA, PCA def plot_results(X, y, label): plt.plot(X, y, label=label, marker='o') def benchmark(estimator, data): gc.collect() print("Benching %s" % estimator) t0 = time() estimator.fit(data) training_time = time() - t0 data_t = estimator.transform(data) data_r = estimator.inverse_transform(data_t) reconstruction_error = np.mean(np.abs(data - data_r)) return {'time': training_time, 'error': reconstruction_error} def plot_feature_times(all_times, batch_size, all_components, data): plt.figure() plot_results(all_components, all_times['pca'], label="PCA") plot_results(all_components, all_times['ipca'], label="IncrementalPCA, bsize=%i" % batch_size) plot_results(all_components, all_times['rpca'], label="RandomizedPCA") plt.legend(loc="upper left") plt.suptitle("Algorithm runtime vs. n_components\n \ LFW, size %i x %i" % data.shape) plt.xlabel("Number of components (out of max %i)" % data.shape[1]) plt.ylabel("Time (seconds)") def plot_feature_errors(all_errors, batch_size, all_components, data): plt.figure() plot_results(all_components, all_errors['pca'], label="PCA") plot_results(all_components, all_errors['ipca'], label="IncrementalPCA, bsize=%i" % batch_size) plot_results(all_components, all_errors['rpca'], label="RandomizedPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm error vs. n_components\n" "LFW, size %i x %i" % data.shape) plt.xlabel("Number of components (out of max %i)" % data.shape[1]) plt.ylabel("Mean absolute error") def plot_batch_times(all_times, n_features, all_batch_sizes, data): plt.figure() plot_results(all_batch_sizes, all_times['pca'], label="PCA") plot_results(all_batch_sizes, all_times['rpca'], label="RandomizedPCA") plot_results(all_batch_sizes, all_times['ipca'], label="IncrementalPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm runtime vs. batch_size for n_components %i\n \ LFW, size %i x %i" % ( n_features, data.shape[0], data.shape[1])) plt.xlabel("Batch size") plt.ylabel("Time (seconds)") def plot_batch_errors(all_errors, n_features, all_batch_sizes, data): plt.figure() plot_results(all_batch_sizes, all_errors['pca'], label="PCA") plot_results(all_batch_sizes, all_errors['ipca'], label="IncrementalPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm error vs. batch_size for n_components %i\n \ LFW, size %i x %i" % ( n_features, data.shape[0], data.shape[1])) plt.xlabel("Batch size") plt.ylabel("Mean absolute error") def fixed_batch_size_comparison(data): all_features = [i.astype(int) for i in np.linspace(data.shape[1] // 10, data.shape[1], num=5)] batch_size = 1000 # Compare runtimes and error for fixed batch size all_times = defaultdict(list) all_errors = defaultdict(list) for n_components in all_features: pca = PCA(n_components=n_components) rpca = RandomizedPCA(n_components=n_components, random_state=1999) ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size) results_dict = {k: benchmark(est, data) for k, est in [('pca', pca), ('ipca', ipca), ('rpca', rpca)]} for k in sorted(results_dict.keys()): all_times[k].append(results_dict[k]['time']) all_errors[k].append(results_dict[k]['error']) plot_feature_times(all_times, batch_size, all_features, data) plot_feature_errors(all_errors, batch_size, all_features, data) def variable_batch_size_comparison(data): batch_sizes = [i.astype(int) for i in np.linspace(data.shape[0] // 10, data.shape[0], num=10)] for n_components in [i.astype(int) for i in np.linspace(data.shape[1] // 10, data.shape[1], num=4)]: all_times = defaultdict(list) all_errors = defaultdict(list) pca = PCA(n_components=n_components) rpca = RandomizedPCA(n_components=n_components, random_state=1999) results_dict = {k: benchmark(est, data) for k, est in [('pca', pca), ('rpca', rpca)]} # Create flat baselines to compare the variation over batch size all_times['pca'].extend([results_dict['pca']['time']] * len(batch_sizes)) all_errors['pca'].extend([results_dict['pca']['error']] * len(batch_sizes)) all_times['rpca'].extend([results_dict['rpca']['time']] * len(batch_sizes)) all_errors['rpca'].extend([results_dict['rpca']['error']] * len(batch_sizes)) for batch_size in batch_sizes: ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size) results_dict = {k: benchmark(est, data) for k, est in [('ipca', ipca)]} all_times['ipca'].append(results_dict['ipca']['time']) all_errors['ipca'].append(results_dict['ipca']['error']) plot_batch_times(all_times, n_components, batch_sizes, data) # RandomizedPCA error is always worse (approx 100x) than other PCA # tests plot_batch_errors(all_errors, n_components, batch_sizes, data) faces = fetch_lfw_people(resize=.2, min_faces_per_person=5) # limit dataset to 5000 people (don't care who they are!) X = faces.data[:5000] n_samples, h, w = faces.images.shape n_features = X.shape[1] X -= X.mean(axis=0) X /= X.std(axis=0) fixed_batch_size_comparison(X) variable_batch_size_comparison(X) plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_lasso_path.py000066400000000000000000000076451317344356400232360ustar00rootroot00000000000000"""Benchmarks of Lasso regularization path computation using Lars and CD The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function from collections import defaultdict import gc import sys from time import time import numpy as np from sklearn.linear_model import lars_path from sklearn.linear_model import lasso_path from sklearn.datasets.samples_generator import make_regression def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') dataset_kwargs = { 'n_samples': n_samples, 'n_features': n_features, 'n_informative': n_features / 10, 'effective_rank': min(n_samples, n_features) / 10, #'effective_rank': None, 'bias': 0.0, } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) X, y = make_regression(**dataset_kwargs) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (with Gram)'].append(delta) gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (without Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (with Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=True) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (with Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (without Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=False) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (without Gram)'].append(delta) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(10, 2000, 5).astype(np.int) features_range = np.linspace(10, 2000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(max(t) for t in results.values()) fig = plt.figure('scikit-learn Lasso path benchmark results') i = 1 for c, (label, timings) in zip('bcry', sorted(results.items())): ax = fig.add_subplot(2, 2, i, projection='3d') X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.8) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) # ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.set_zlim3d(0.0, max_time * 1.1) ax.set_title(label) # ax.legend() i += 1 plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_neighbors.py000066400000000000000000000145051317344356400230520ustar00rootroot00000000000000""" Plot the scaling of the nearest neighbors algorithms with k, D, and N """ from time import time import numpy as np import matplotlib.pyplot as plt from matplotlib import ticker from sklearn import neighbors, datasets def get_data(N, D, dataset='dense'): if dataset == 'dense': np.random.seed(0) return np.random.random((N, D)) elif dataset == 'digits': X = datasets.load_digits().data i = np.argsort(X[0])[::-1] X = X[:, i] return X[:N, :D] else: raise ValueError("invalid dataset: %s" % dataset) def barplot_neighbors(Nrange=2 ** np.arange(1, 11), Drange=2 ** np.arange(7), krange=2 ** np.arange(10), N=1000, D=64, k=5, leaf_size=30, dataset='digits'): algorithms = ('kd_tree', 'brute', 'ball_tree') fiducial_values = {'N': N, 'D': D, 'k': k} #------------------------------------------------------------ # varying N N_results_build = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) N_results_query = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) for i, NN in enumerate(Nrange): print("N = %i (%i out of %i)" % (NN, i + 1, len(Nrange))) X = get_data(NN, D, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=min(NN, k), algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() N_results_build[algorithm][i] = (t1 - t0) N_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying D D_results_build = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) D_results_query = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) for i, DD in enumerate(Drange): print("D = %i (%i out of %i)" % (DD, i + 1, len(Drange))) X = get_data(N, DD, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() D_results_build[algorithm][i] = (t1 - t0) D_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying k k_results_build = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) k_results_query = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) X = get_data(N, DD, dataset) for i, kk in enumerate(krange): print("k = %i (%i out of %i)" % (kk, i + 1, len(krange))) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=kk, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() k_results_build[algorithm][i] = (t1 - t0) k_results_query[algorithm][i] = (t2 - t1) plt.figure(figsize=(8, 11)) for (sbplt, vals, quantity, build_time, query_time) in [(311, Nrange, 'N', N_results_build, N_results_query), (312, Drange, 'D', D_results_build, D_results_query), (313, krange, 'k', k_results_build, k_results_query)]: ax = plt.subplot(sbplt, yscale='log') plt.grid(True) tick_vals = [] tick_labels = [] bottom = 10 ** np.min([min(np.floor(np.log10(build_time[alg]))) for alg in algorithms]) for i, alg in enumerate(algorithms): xvals = 0.1 + i * (1 + len(vals)) + np.arange(len(vals)) width = 0.8 c_bar = plt.bar(xvals, build_time[alg] - bottom, width, bottom, color='r') q_bar = plt.bar(xvals, query_time[alg], width, build_time[alg], color='b') tick_vals += list(xvals + 0.5 * width) tick_labels += ['%i' % val for val in vals] plt.text((i + 0.02) / len(algorithms), 0.98, alg, transform=ax.transAxes, ha='left', va='top', bbox=dict(facecolor='w', edgecolor='w', alpha=0.5)) plt.ylabel('Time (s)') ax.xaxis.set_major_locator(ticker.FixedLocator(tick_vals)) ax.xaxis.set_major_formatter(ticker.FixedFormatter(tick_labels)) for label in ax.get_xticklabels(): label.set_rotation(-90) label.set_fontsize(10) title_string = 'Varying %s' % quantity descr_string = '' for s in 'NDk': if s == quantity: pass else: descr_string += '%s = %i, ' % (s, fiducial_values[s]) descr_string = descr_string[:-2] plt.text(1.01, 0.5, title_string, transform=ax.transAxes, rotation=-90, ha='left', va='center', fontsize=20) plt.text(0.99, 0.5, descr_string, transform=ax.transAxes, rotation=-90, ha='right', va='center') plt.gcf().suptitle("%s data set" % dataset.capitalize(), fontsize=16) plt.figlegend((c_bar, q_bar), ('construction', 'N-point query'), 'upper right') if __name__ == '__main__': barplot_neighbors(dataset='digits') barplot_neighbors(dataset='dense') plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_nmf.py000066400000000000000000000363341317344356400216560ustar00rootroot00000000000000""" Benchmarks of Non-Negative Matrix Factorization """ # Authors: Tom Dupre la Tour (benchmark) # Chih-Jen Linn (original projected gradient NMF implementation) # Anthony Di Franco (projected gradient, Python and NumPy port) # License: BSD 3 clause from __future__ import print_function from time import time import sys import warnings import numbers import numpy as np import matplotlib.pyplot as plt import pandas from sklearn.utils.testing import ignore_warnings from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.decomposition.nmf import NMF from sklearn.decomposition.nmf import _initialize_nmf from sklearn.decomposition.nmf import _beta_divergence from sklearn.decomposition.nmf import INTEGER_TYPES, _check_init from sklearn.externals.joblib import Memory from sklearn.exceptions import ConvergenceWarning from sklearn.utils.extmath import safe_sparse_dot, squared_norm from sklearn.utils import check_array from sklearn.utils.validation import check_is_fitted, check_non_negative mem = Memory(cachedir='.', verbose=0) ################### # Start of _PGNMF # ################### # This class implements a projected gradient solver for the NMF. # The projected gradient solver was removed from scikit-learn in version 0.19, # and a simplified copy is used here for comparison purpose only. # It is not tested, and it may change or disappear without notice. def _norm(x): """Dot product-based Euclidean norm implementation See: http://fseoane.net/blog/2011/computing-the-vector-norm/ """ return np.sqrt(squared_norm(x)) def _nls_subproblem(X, W, H, tol, max_iter, alpha=0., l1_ratio=0., sigma=0.01, beta=0.1): """Non-negative least square solver Solves a non-negative least squares subproblem using the projected gradient descent algorithm. Parameters ---------- X : array-like, shape (n_samples, n_features) Constant matrix. W : array-like, shape (n_samples, n_components) Constant matrix. H : array-like, shape (n_components, n_features) Initial guess for the solution. tol : float Tolerance of the stopping condition. max_iter : int Maximum number of iterations before timing out. alpha : double, default: 0. Constant that multiplies the regularization terms. Set it to zero to have no regularization. l1_ratio : double, default: 0. The regularization mixing parameter, with 0 <= l1_ratio <= 1. For l1_ratio = 0 the penalty is an L2 penalty. For l1_ratio = 1 it is an L1 penalty. For 0 < l1_ratio < 1, the penalty is a combination of L1 and L2. sigma : float Constant used in the sufficient decrease condition checked by the line search. Smaller values lead to a looser sufficient decrease condition, thus reducing the time taken by the line search, but potentially increasing the number of iterations of the projected gradient procedure. 0.01 is a commonly used value in the optimization literature. beta : float Factor by which the step size is decreased (resp. increased) until (resp. as long as) the sufficient decrease condition is satisfied. Larger values allow to find a better step size but lead to longer line search. 0.1 is a commonly used value in the optimization literature. Returns ------- H : array-like, shape (n_components, n_features) Solution to the non-negative least squares problem. grad : array-like, shape (n_components, n_features) The gradient. n_iter : int The number of iterations done by the algorithm. References ---------- C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation, 19(2007), 2756-2779. http://www.csie.ntu.edu.tw/~cjlin/nmf/ """ WtX = safe_sparse_dot(W.T, X) WtW = np.dot(W.T, W) # values justified in the paper (alpha is renamed gamma) gamma = 1 for n_iter in range(1, max_iter + 1): grad = np.dot(WtW, H) - WtX if alpha > 0 and l1_ratio == 1.: grad += alpha elif alpha > 0: grad += alpha * (l1_ratio + (1 - l1_ratio) * H) # The following multiplication with a boolean array is more than twice # as fast as indexing into grad. if _norm(grad * np.logical_or(grad < 0, H > 0)) < tol: break Hp = H for inner_iter in range(20): # Gradient step. Hn = H - gamma * grad # Projection step. Hn *= Hn > 0 d = Hn - H gradd = np.dot(grad.ravel(), d.ravel()) dQd = np.dot(np.dot(WtW, d).ravel(), d.ravel()) suff_decr = (1 - sigma) * gradd + 0.5 * dQd < 0 if inner_iter == 0: decr_gamma = not suff_decr if decr_gamma: if suff_decr: H = Hn break else: gamma *= beta elif not suff_decr or (Hp == Hn).all(): H = Hp break else: gamma /= beta Hp = Hn if n_iter == max_iter: warnings.warn("Iteration limit reached in nls subproblem.", ConvergenceWarning) return H, grad, n_iter def _fit_projected_gradient(X, W, H, tol, max_iter, nls_max_iter, alpha, l1_ratio): gradW = (np.dot(W, np.dot(H, H.T)) - safe_sparse_dot(X, H.T, dense_output=True)) gradH = (np.dot(np.dot(W.T, W), H) - safe_sparse_dot(W.T, X, dense_output=True)) init_grad = squared_norm(gradW) + squared_norm(gradH.T) # max(0.001, tol) to force alternating minimizations of W and H tolW = max(0.001, tol) * np.sqrt(init_grad) tolH = tolW for n_iter in range(1, max_iter + 1): # stopping condition as discussed in paper proj_grad_W = squared_norm(gradW * np.logical_or(gradW < 0, W > 0)) proj_grad_H = squared_norm(gradH * np.logical_or(gradH < 0, H > 0)) if (proj_grad_W + proj_grad_H) / init_grad < tol ** 2: break # update W Wt, gradWt, iterW = _nls_subproblem(X.T, H.T, W.T, tolW, nls_max_iter, alpha=alpha, l1_ratio=l1_ratio) W, gradW = Wt.T, gradWt.T if iterW == 1: tolW = 0.1 * tolW # update H H, gradH, iterH = _nls_subproblem(X, W, H, tolH, nls_max_iter, alpha=alpha, l1_ratio=l1_ratio) if iterH == 1: tolH = 0.1 * tolH H[H == 0] = 0 # fix up negative zeros if n_iter == max_iter: Wt, _, _ = _nls_subproblem(X.T, H.T, W.T, tolW, nls_max_iter, alpha=alpha, l1_ratio=l1_ratio) W = Wt.T return W, H, n_iter class _PGNMF(NMF): """Non-Negative Matrix Factorization (NMF) with projected gradient solver. This class is private and for comparison purpose only. It may change or disappear without notice. """ def __init__(self, n_components=None, solver='pg', init=None, tol=1e-4, max_iter=200, random_state=None, alpha=0., l1_ratio=0., nls_max_iter=10): super(_PGNMF, self).__init__( n_components=n_components, init=init, solver=solver, tol=tol, max_iter=max_iter, random_state=random_state, alpha=alpha, l1_ratio=l1_ratio) self.nls_max_iter = nls_max_iter def fit(self, X, y=None, **params): self.fit_transform(X, **params) return self def transform(self, X): check_is_fitted(self, 'components_') H = self.components_ W, _, self.n_iter_ = self._fit_transform(X, H=H, update_H=False) return W def inverse_transform(self, W): check_is_fitted(self, 'components_') return np.dot(W, self.components_) def fit_transform(self, X, y=None, W=None, H=None): W, H, self.n_iter = self._fit_transform(X, W=W, H=H, update_H=True) self.components_ = H return W def _fit_transform(self, X, y=None, W=None, H=None, update_H=True): X = check_array(X, accept_sparse=('csr', 'csc')) check_non_negative(X, "NMF (input X)") n_samples, n_features = X.shape n_components = self.n_components if n_components is None: n_components = n_features if (not isinstance(n_components, INTEGER_TYPES) or n_components <= 0): raise ValueError("Number of components must be a positive integer;" " got (n_components=%r)" % n_components) if not isinstance(self.max_iter, INTEGER_TYPES) or self.max_iter < 0: raise ValueError("Maximum number of iterations must be a positive " "integer; got (max_iter=%r)" % self.max_iter) if not isinstance(self.tol, numbers.Number) or self.tol < 0: raise ValueError("Tolerance for stopping criteria must be " "positive; got (tol=%r)" % self.tol) # check W and H, or initialize them if self.init == 'custom' and update_H: _check_init(H, (n_components, n_features), "NMF (input H)") _check_init(W, (n_samples, n_components), "NMF (input W)") elif not update_H: _check_init(H, (n_components, n_features), "NMF (input H)") W = np.zeros((n_samples, n_components)) else: W, H = _initialize_nmf(X, n_components, init=self.init, random_state=self.random_state) if update_H: # fit_transform W, H, n_iter = _fit_projected_gradient( X, W, H, self.tol, self.max_iter, self.nls_max_iter, self.alpha, self.l1_ratio) else: # transform Wt, _, n_iter = _nls_subproblem(X.T, H.T, W.T, self.tol, self.nls_max_iter, alpha=self.alpha, l1_ratio=self.l1_ratio) W = Wt.T if n_iter == self.max_iter and self.tol > 0: warnings.warn("Maximum number of iteration %d reached. Increase it" " to improve convergence." % self.max_iter, ConvergenceWarning) return W, H, n_iter ################# # End of _PGNMF # ################# def plot_results(results_df, plot_name): if results_df is None: return None plt.figure(figsize=(16, 6)) colors = 'bgr' markers = 'ovs' ax = plt.subplot(1, 3, 1) for i, init in enumerate(np.unique(results_df['init'])): plt.subplot(1, 3, i + 1, sharex=ax, sharey=ax) for j, method in enumerate(np.unique(results_df['method'])): mask = np.logical_and(results_df['init'] == init, results_df['method'] == method) selected_items = results_df[mask] plt.plot(selected_items['time'], selected_items['loss'], color=colors[j % len(colors)], ls='-', marker=markers[j % len(markers)], label=method) plt.legend(loc=0, fontsize='x-small') plt.xlabel("Time (s)") plt.ylabel("loss") plt.title("%s" % init) plt.suptitle(plot_name, fontsize=16) @ignore_warnings(category=ConvergenceWarning) # use joblib to cache the results. # X_shape is specified in arguments for avoiding hashing X @mem.cache(ignore=['X', 'W0', 'H0']) def bench_one(name, X, W0, H0, X_shape, clf_type, clf_params, init, n_components, random_state): W = W0.copy() H = H0.copy() clf = clf_type(**clf_params) st = time() W = clf.fit_transform(X, W=W, H=H) end = time() H = clf.components_ this_loss = _beta_divergence(X, W, H, 2.0, True) duration = end - st return this_loss, duration def run_bench(X, clfs, plot_name, n_components, tol, alpha, l1_ratio): start = time() results = [] for name, clf_type, iter_range, clf_params in clfs: print("Training %s:" % name) for rs, init in enumerate(('nndsvd', 'nndsvdar', 'random')): print(" %s %s: " % (init, " " * (8 - len(init))), end="") W, H = _initialize_nmf(X, n_components, init, 1e-6, rs) for max_iter in iter_range: clf_params['alpha'] = alpha clf_params['l1_ratio'] = l1_ratio clf_params['max_iter'] = max_iter clf_params['tol'] = tol clf_params['random_state'] = rs clf_params['init'] = 'custom' clf_params['n_components'] = n_components this_loss, duration = bench_one(name, X, W, H, X.shape, clf_type, clf_params, init, n_components, rs) init_name = "init='%s'" % init results.append((name, this_loss, duration, init_name)) # print("loss: %.6f, time: %.3f sec" % (this_loss, duration)) print(".", end="") sys.stdout.flush() print(" ") # Use a panda dataframe to organize the results results_df = pandas.DataFrame(results, columns="method loss time init".split()) print("Total time = %0.3f sec\n" % (time() - start)) # plot the results plot_results(results_df, plot_name) return results_df def load_20news(): print("Loading 20 newsgroups dataset") print("-----------------------------") from sklearn.datasets import fetch_20newsgroups dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes')) vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english') tfidf = vectorizer.fit_transform(dataset.data) return tfidf def load_faces(): print("Loading Olivetti face dataset") print("-----------------------------") from sklearn.datasets import fetch_olivetti_faces faces = fetch_olivetti_faces(shuffle=True) return faces.data def build_clfs(cd_iters, pg_iters, mu_iters): clfs = [("Coordinate Descent", NMF, cd_iters, {'solver': 'cd'}), ("Projected Gradient", _PGNMF, pg_iters, {'solver': 'pg'}), ("Multiplicative Update", NMF, mu_iters, {'solver': 'mu'}), ] return clfs if __name__ == '__main__': alpha = 0. l1_ratio = 0.5 n_components = 10 tol = 1e-15 # first benchmark on 20 newsgroup dataset: sparse, shape(11314, 39116) plot_name = "20 Newsgroups sparse dataset" cd_iters = np.arange(1, 30) pg_iters = np.arange(1, 6) mu_iters = np.arange(1, 30) clfs = build_clfs(cd_iters, pg_iters, mu_iters) X_20news = load_20news() run_bench(X_20news, clfs, plot_name, n_components, tol, alpha, l1_ratio) # second benchmark on Olivetti faces dataset: dense, shape(400, 4096) plot_name = "Olivetti Faces dense dataset" cd_iters = np.arange(1, 30) pg_iters = np.arange(1, 12) mu_iters = np.arange(1, 30) clfs = build_clfs(cd_iters, pg_iters, mu_iters) X_faces = load_faces() run_bench(X_faces, clfs, plot_name, n_components, tol, alpha, l1_ratio,) plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_omp_lars.py000066400000000000000000000106421317344356400227040ustar00rootroot00000000000000"""Benchmarks of orthogonal matching pursuit (:ref:`OMP`) versus least angle regression (:ref:`least_angle_regression`) The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function import gc import sys from time import time import six import numpy as np from sklearn.linear_model import lars_path, orthogonal_mp from sklearn.datasets.samples_generator import make_sparse_coded_signal def compute_bench(samples_range, features_range): it = 0 results = dict() lars = np.empty((len(features_range), len(samples_range))) lars_gram = lars.copy() omp = lars.copy() omp_gram = lars.copy() max_it = len(samples_range) * len(features_range) for i_s, n_samples in enumerate(samples_range): for i_f, n_features in enumerate(features_range): it += 1 n_informative = n_features / 10 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') # dataset_kwargs = { # 'n_train_samples': n_samples, # 'n_test_samples': 2, # 'n_features': n_features, # 'n_informative': n_informative, # 'effective_rank': min(n_samples, n_features) / 10, # #'effective_rank': None, # 'bias': 0.0, # } dataset_kwargs = { 'n_samples': 1, 'n_components': n_features, 'n_features': n_samples, 'n_nonzero_coefs': n_informative, 'random_state': 0 } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) y, X, _ = make_sparse_coded_signal(**dataset_kwargs) X = np.asfortranarray(X) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars_gram[i_f, i_s] = delta gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, Gram=None, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (with Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute=True, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp_gram[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (without Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute=False, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp[i_f, i_s] = delta results['time(LARS) / time(OMP)\n (w/ Gram)'] = (lars_gram / omp_gram) results['time(LARS) / time(OMP)\n (w/o Gram)'] = (lars / omp) return results if __name__ == '__main__': samples_range = np.linspace(1000, 5000, 5).astype(np.int) features_range = np.linspace(1000, 5000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(np.max(t) for t in results.values()) import matplotlib.pyplot as plt fig = plt.figure('scikit-learn OMP vs. LARS benchmark results') for i, (label, timings) in enumerate(sorted(six.iteritems(results))): ax = fig.add_subplot(1, 2, i+1) vmax = max(1 - timings.min(), -1 + timings.max()) plt.matshow(timings, fignum=False, vmin=1 - vmax, vmax=1 + vmax) ax.set_xticklabels([''] + [str(each) for each in samples_range]) ax.set_yticklabels([''] + [str(each) for each in features_range]) plt.xlabel('n_samples') plt.ylabel('n_features') plt.title(label) plt.subplots_adjust(0.1, 0.08, 0.96, 0.98, 0.4, 0.63) ax = plt.axes([0.1, 0.08, 0.8, 0.06]) plt.colorbar(cax=ax, orientation='horizontal') plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_parallel_pairwise.py000066400000000000000000000023661317344356400245730ustar00rootroot00000000000000# Author: Mathieu Blondel # License: BSD 3 clause import time import matplotlib.pyplot as plt from sklearn.utils import check_random_state from sklearn.metrics.pairwise import pairwise_distances from sklearn.metrics.pairwise import pairwise_kernels def plot(func): random_state = check_random_state(0) one_core = [] multi_core = [] sample_sizes = range(1000, 6000, 1000) for n_samples in sample_sizes: X = random_state.rand(n_samples, 300) start = time.time() func(X, n_jobs=1) one_core.append(time.time() - start) start = time.time() func(X, n_jobs=-1) multi_core.append(time.time() - start) plt.figure('scikit-learn parallel %s benchmark results' % func.__name__) plt.plot(sample_sizes, one_core, label="one core") plt.plot(sample_sizes, multi_core, label="multi core") plt.xlabel('n_samples') plt.ylabel('Time (s)') plt.title('Parallel %s' % func.__name__) plt.legend() def euclidean_distances(X, n_jobs): return pairwise_distances(X, metric="euclidean", n_jobs=n_jobs) def rbf_kernels(X, n_jobs): return pairwise_kernels(X, metric="rbf", n_jobs=n_jobs, gamma=0.1) plot(euclidean_distances) plot(rbf_kernels) plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_randomized_svd.py000066400000000000000000000422151317344356400241010ustar00rootroot00000000000000""" Benchmarks on the power iterations phase in randomized SVD. We test on various synthetic and real datasets the effect of increasing the number of power iterations in terms of quality of approximation and running time. A number greater than 0 should help with noisy matrices, which are characterized by a slow spectral decay. We test several policy for normalizing the power iterations. Normalization is crucial to avoid numerical issues. The quality of the approximation is measured by the spectral norm discrepancy between the original input matrix and the reconstructed one (by multiplying the randomized_svd's outputs). The spectral norm is always equivalent to the largest singular value of a matrix. (3) justifies this choice. However, one can notice in these experiments that Frobenius and spectral norms behave very similarly in a qualitative sense. Therefore, we suggest to run these benchmarks with `enable_spectral_norm = False`, as Frobenius' is MUCH faster to compute. The benchmarks follow. (a) plot: time vs norm, varying number of power iterations data: many datasets goal: compare normalization policies and study how the number of power iterations affect time and norm (b) plot: n_iter vs norm, varying rank of data and number of components for randomized_SVD data: low-rank matrices on which we control the rank goal: study whether the rank of the matrix and the number of components extracted by randomized SVD affect "the optimal" number of power iterations (c) plot: time vs norm, varing datasets data: many datasets goal: compare default configurations We compare the following algorithms: - randomized_svd(..., power_iteration_normalizer='none') - randomized_svd(..., power_iteration_normalizer='LU') - randomized_svd(..., power_iteration_normalizer='QR') - randomized_svd(..., power_iteration_normalizer='auto') - fbpca.pca() from https://github.com/facebook/fbpca (if installed) Conclusion ---------- - n_iter=2 appears to be a good default value - power_iteration_normalizer='none' is OK if n_iter is small, otherwise LU gives similar errors to QR but is cheaper. That's what 'auto' implements. References ---------- (1) Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions Halko, et al., 2009 http://arxiv.org/abs/arXiv:0909.4061 (2) A randomized algorithm for the decomposition of matrices Per-Gunnar Martinsson, Vladimir Rokhlin and Mark Tygert (3) An implementation of a randomized algorithm for principal component analysis A. Szlam et al. 2014 """ # Author: Giorgio Patrini import numpy as np import scipy as sp import matplotlib.pyplot as plt import gc import pickle from time import time from collections import defaultdict import os.path from sklearn.utils import gen_batches from sklearn.utils.validation import check_random_state from sklearn.utils.extmath import randomized_svd from sklearn.datasets.samples_generator import (make_low_rank_matrix, make_sparse_uncorrelated) from sklearn.datasets import (fetch_lfw_people, fetch_mldata, fetch_20newsgroups_vectorized, fetch_olivetti_faces, fetch_rcv1) try: import fbpca fbpca_available = True except ImportError: fbpca_available = False # If this is enabled, tests are much slower and will crash with the large data enable_spectral_norm = False # TODO: compute approximate spectral norms with the power method as in # Estimating the largest eigenvalues by the power and Lanczos methods with # a random start, Jacek Kuczynski and Henryk Wozniakowski, SIAM Journal on # Matrix Analysis and Applications, 13 (4): 1094-1122, 1992. # This approximation is a very fast estimate of the spectral norm, but depends # on starting random vectors. # Determine when to switch to batch computation for matrix norms, # in case the reconstructed (dense) matrix is too large MAX_MEMORY = np.int(2e9) # The following datasets can be dowloaded manually from: # CIFAR 10: http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz # SVHN: http://ufldl.stanford.edu/housenumbers/train_32x32.mat CIFAR_FOLDER = "./cifar-10-batches-py/" SVHN_FOLDER = "./SVHN/" datasets = ['low rank matrix', 'lfw_people', 'olivetti_faces', '20newsgroups', 'MNIST original', 'CIFAR', 'a1a', 'SVHN', 'uncorrelated matrix'] big_sparse_datasets = ['big sparse matrix', 'rcv1'] def unpickle(file_name): with open(file_name, 'rb') as fo: return pickle.load(fo, encoding='latin1')["data"] def handle_missing_dataset(file_folder): if not os.path.isdir(file_folder): print("%s file folder not found. Test skipped." % file_folder) return 0 def get_data(dataset_name): print("Getting dataset: %s" % dataset_name) if dataset_name == 'lfw_people': X = fetch_lfw_people().data elif dataset_name == '20newsgroups': X = fetch_20newsgroups_vectorized().data[:, :100000] elif dataset_name == 'olivetti_faces': X = fetch_olivetti_faces().data elif dataset_name == 'rcv1': X = fetch_rcv1().data elif dataset_name == 'CIFAR': if handle_missing_dataset(CIFAR_FOLDER) == "skip": return X1 = [unpickle("%sdata_batch_%d" % (CIFAR_FOLDER, i + 1)) for i in range(5)] X = np.vstack(X1) del X1 elif dataset_name == 'SVHN': if handle_missing_dataset(SVHN_FOLDER) == 0: return X1 = sp.io.loadmat("%strain_32x32.mat" % SVHN_FOLDER)['X'] X2 = [X1[:, :, :, i].reshape(32 * 32 * 3) for i in range(X1.shape[3])] X = np.vstack(X2) del X1 del X2 elif dataset_name == 'low rank matrix': X = make_low_rank_matrix(n_samples=500, n_features=np.int(1e4), effective_rank=100, tail_strength=.5, random_state=random_state) elif dataset_name == 'uncorrelated matrix': X, _ = make_sparse_uncorrelated(n_samples=500, n_features=10000, random_state=random_state) elif dataset_name == 'big sparse matrix': sparsity = np.int(1e6) size = np.int(1e6) small_size = np.int(1e4) data = np.random.normal(0, 1, np.int(sparsity/10)) data = np.repeat(data, 10) row = np.random.uniform(0, small_size, sparsity) col = np.random.uniform(0, small_size, sparsity) X = sp.sparse.csr_matrix((data, (row, col)), shape=(size, small_size)) del data del row del col else: X = fetch_mldata(dataset_name).data return X def plot_time_vs_s(time, norm, point_labels, title): plt.figure() colors = ['g', 'b', 'y'] for i, l in enumerate(sorted(norm.keys())): if l != "fbpca": plt.plot(time[l], norm[l], label=l, marker='o', c=colors.pop()) else: plt.plot(time[l], norm[l], label=l, marker='^', c='red') for label, x, y in zip(point_labels, list(time[l]), list(norm[l])): plt.annotate(label, xy=(x, y), xytext=(0, -20), textcoords='offset points', ha='right', va='bottom') plt.legend(loc="upper right") plt.suptitle(title) plt.ylabel("norm discrepancy") plt.xlabel("running time [s]") def scatter_time_vs_s(time, norm, point_labels, title): plt.figure() size = 100 for i, l in enumerate(sorted(norm.keys())): if l != "fbpca": plt.scatter(time[l], norm[l], label=l, marker='o', c='b', s=size) for label, x, y in zip(point_labels, list(time[l]), list(norm[l])): plt.annotate(label, xy=(x, y), xytext=(0, -80), textcoords='offset points', ha='right', arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), va='bottom', size=11, rotation=90) else: plt.scatter(time[l], norm[l], label=l, marker='^', c='red', s=size) for label, x, y in zip(point_labels, list(time[l]), list(norm[l])): plt.annotate(label, xy=(x, y), xytext=(0, 30), textcoords='offset points', ha='right', arrowprops=dict(arrowstyle="->", connectionstyle="arc3"), va='bottom', size=11, rotation=90) plt.legend(loc="best") plt.suptitle(title) plt.ylabel("norm discrepancy") plt.xlabel("running time [s]") def plot_power_iter_vs_s(power_iter, s, title): plt.figure() for l in sorted(s.keys()): plt.plot(power_iter, s[l], label=l, marker='o') plt.legend(loc="lower right", prop={'size': 10}) plt.suptitle(title) plt.ylabel("norm discrepancy") plt.xlabel("n_iter") def svd_timing(X, n_comps, n_iter, n_oversamples, power_iteration_normalizer='auto', method=None): """ Measure time for decomposition """ print("... running SVD ...") if method is not 'fbpca': gc.collect() t0 = time() U, mu, V = randomized_svd(X, n_comps, n_oversamples, n_iter, power_iteration_normalizer, random_state=random_state, transpose=False) call_time = time() - t0 else: gc.collect() t0 = time() # There is a different convention for l here U, mu, V = fbpca.pca(X, n_comps, raw=True, n_iter=n_iter, l=n_oversamples+n_comps) call_time = time() - t0 return U, mu, V, call_time def norm_diff(A, norm=2, msg=True): """ Compute the norm diff with the original matrix, when randomized SVD is called with *params. norm: 2 => spectral; 'fro' => Frobenius """ if msg: print("... computing %s norm ..." % norm) if norm == 2: # s = sp.linalg.norm(A, ord=2) # slow value = sp.sparse.linalg.svds(A, k=1, return_singular_vectors=False) else: if sp.sparse.issparse(A): value = sp.sparse.linalg.norm(A, ord=norm) else: value = sp.linalg.norm(A, ord=norm) return value def scalable_frobenius_norm_discrepancy(X, U, s, V): # if the input is not too big, just call scipy if X.shape[0] * X.shape[1] < MAX_MEMORY: A = X - U.dot(np.diag(s).dot(V)) return norm_diff(A, norm='fro') print("... computing fro norm by batches...") batch_size = 1000 Vhat = np.diag(s).dot(V) cum_norm = .0 for batch in gen_batches(X.shape[0], batch_size): M = X[batch, :] - U[batch, :].dot(Vhat) cum_norm += norm_diff(M, norm='fro', msg=False) return np.sqrt(cum_norm) def bench_a(X, dataset_name, power_iter, n_oversamples, n_comps): all_time = defaultdict(list) if enable_spectral_norm: all_spectral = defaultdict(list) X_spectral_norm = norm_diff(X, norm=2, msg=False) all_frobenius = defaultdict(list) X_fro_norm = norm_diff(X, norm='fro', msg=False) for pi in power_iter: for pm in ['none', 'LU', 'QR']: print("n_iter = %d on sklearn - %s" % (pi, pm)) U, s, V, time = svd_timing(X, n_comps, n_iter=pi, power_iteration_normalizer=pm, n_oversamples=n_oversamples) label = "sklearn - %s" % pm all_time[label].append(time) if enable_spectral_norm: A = U.dot(np.diag(s).dot(V)) all_spectral[label].append(norm_diff(X - A, norm=2) / X_spectral_norm) f = scalable_frobenius_norm_discrepancy(X, U, s, V) all_frobenius[label].append(f / X_fro_norm) if fbpca_available: print("n_iter = %d on fbca" % (pi)) U, s, V, time = svd_timing(X, n_comps, n_iter=pi, power_iteration_normalizer=pm, n_oversamples=n_oversamples, method='fbpca') label = "fbpca" all_time[label].append(time) if enable_spectral_norm: A = U.dot(np.diag(s).dot(V)) all_spectral[label].append(norm_diff(X - A, norm=2) / X_spectral_norm) f = scalable_frobenius_norm_discrepancy(X, U, s, V) all_frobenius[label].append(f / X_fro_norm) if enable_spectral_norm: title = "%s: spectral norm diff vs running time" % (dataset_name) plot_time_vs_s(all_time, all_spectral, power_iter, title) title = "%s: Frobenius norm diff vs running time" % (dataset_name) plot_time_vs_s(all_time, all_frobenius, power_iter, title) def bench_b(power_list): n_samples, n_features = 1000, 10000 data_params = {'n_samples': n_samples, 'n_features': n_features, 'tail_strength': .7, 'random_state': random_state} dataset_name = "low rank matrix %d x %d" % (n_samples, n_features) ranks = [10, 50, 100] if enable_spectral_norm: all_spectral = defaultdict(list) all_frobenius = defaultdict(list) for rank in ranks: X = make_low_rank_matrix(effective_rank=rank, **data_params) if enable_spectral_norm: X_spectral_norm = norm_diff(X, norm=2, msg=False) X_fro_norm = norm_diff(X, norm='fro', msg=False) for n_comp in [np.int(rank/2), rank, rank*2]: label = "rank=%d, n_comp=%d" % (rank, n_comp) print(label) for pi in power_list: U, s, V, _ = svd_timing(X, n_comp, n_iter=pi, n_oversamples=2, power_iteration_normalizer='LU') if enable_spectral_norm: A = U.dot(np.diag(s).dot(V)) all_spectral[label].append(norm_diff(X - A, norm=2) / X_spectral_norm) f = scalable_frobenius_norm_discrepancy(X, U, s, V) all_frobenius[label].append(f / X_fro_norm) if enable_spectral_norm: title = "%s: spectral norm diff vs n power iteration" % (dataset_name) plot_power_iter_vs_s(power_iter, all_spectral, title) title = "%s: Frobenius norm diff vs n power iteration" % (dataset_name) plot_power_iter_vs_s(power_iter, all_frobenius, title) def bench_c(datasets, n_comps): all_time = defaultdict(list) if enable_spectral_norm: all_spectral = defaultdict(list) all_frobenius = defaultdict(list) for dataset_name in datasets: X = get_data(dataset_name) if X is None: continue if enable_spectral_norm: X_spectral_norm = norm_diff(X, norm=2, msg=False) X_fro_norm = norm_diff(X, norm='fro', msg=False) n_comps = np.minimum(n_comps, np.min(X.shape)) label = "sklearn" print("%s %d x %d - %s" % (dataset_name, X.shape[0], X.shape[1], label)) U, s, V, time = svd_timing(X, n_comps, n_iter=2, n_oversamples=10, method=label) all_time[label].append(time) if enable_spectral_norm: A = U.dot(np.diag(s).dot(V)) all_spectral[label].append(norm_diff(X - A, norm=2) / X_spectral_norm) f = scalable_frobenius_norm_discrepancy(X, U, s, V) all_frobenius[label].append(f / X_fro_norm) if fbpca_available: label = "fbpca" print("%s %d x %d - %s" % (dataset_name, X.shape[0], X.shape[1], label)) U, s, V, time = svd_timing(X, n_comps, n_iter=2, n_oversamples=2, method=label) all_time[label].append(time) if enable_spectral_norm: A = U.dot(np.diag(s).dot(V)) all_spectral[label].append(norm_diff(X - A, norm=2) / X_spectral_norm) f = scalable_frobenius_norm_discrepancy(X, U, s, V) all_frobenius[label].append(f / X_fro_norm) if len(all_time) == 0: raise ValueError("No tests ran. Aborting.") if enable_spectral_norm: title = "normalized spectral norm diff vs running time" scatter_time_vs_s(all_time, all_spectral, datasets, title) title = "normalized Frobenius norm diff vs running time" scatter_time_vs_s(all_time, all_frobenius, datasets, title) if __name__ == '__main__': random_state = check_random_state(1234) power_iter = np.linspace(0, 6, 7, dtype=int) n_comps = 50 for dataset_name in datasets: X = get_data(dataset_name) if X is None: continue print(" >>>>>> Benching sklearn and fbpca on %s %d x %d" % (dataset_name, X.shape[0], X.shape[1])) bench_a(X, dataset_name, power_iter, n_oversamples=2, n_comps=np.minimum(n_comps, np.min(X.shape))) print(" >>>>>> Benching on simulated low rank matrix with variable rank") bench_b(power_iter) print(" >>>>>> Benching sklearn and fbpca default configurations") bench_c(datasets + big_sparse_datasets, n_comps) plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_svd.py000066400000000000000000000055421317344356400216670ustar00rootroot00000000000000"""Benchmarks of Singular Value Decomposition (Exact and Approximate) The data is mostly low rank but is a fat infinite tail. """ import gc from time import time import numpy as np from collections import defaultdict import six from scipy.linalg import svd from sklearn.utils.extmath import randomized_svd from sklearn.datasets.samples_generator import make_low_rank_matrix def compute_bench(samples_range, features_range, n_iter=3, rank=50): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') X = make_low_rank_matrix(n_samples, n_features, effective_rank=rank, tail_strength=0.2) gc.collect() print("benchmarking scipy svd: ") tstart = time() svd(X, full_matrices=False) results['scipy svd'].append(time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=0") tstart = time() randomized_svd(X, rank, n_iter=0) results['scikit-learn randomized_svd (n_iter=0)'].append( time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=%d " % n_iter) tstart = time() randomized_svd(X, rank, n_iter=n_iter) results['scikit-learn randomized_svd (n_iter=%d)' % n_iter].append(time() - tstart) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(2, 1000, 4).astype(np.int) features_range = np.linspace(2, 1000, 4).astype(np.int) results = compute_bench(samples_range, features_range) label = 'scikit-learn singular value decomposition benchmark results' fig = plt.figure(label) ax = fig.gca(projection='3d') for c, (label, timings) in zip('rbg', sorted(six.iteritems(results))): X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3, color=c) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.legend() plt.show() scikit-learn-0.19.1/benchmarks/bench_plot_ward.py000066400000000000000000000024031317344356400220210ustar00rootroot00000000000000""" Benchmark scikit-learn's Ward implement compared to SciPy's """ import time import numpy as np from scipy.cluster import hierarchy import matplotlib.pyplot as plt from sklearn.cluster import AgglomerativeClustering ward = AgglomerativeClustering(n_clusters=3, linkage='ward') n_samples = np.logspace(.5, 3, 9) n_features = np.logspace(1, 3.5, 7) N_samples, N_features = np.meshgrid(n_samples, n_features) scikits_time = np.zeros(N_samples.shape) scipy_time = np.zeros(N_samples.shape) for i, n in enumerate(n_samples): for j, p in enumerate(n_features): X = np.random.normal(size=(n, p)) t0 = time.time() ward.fit(X) scikits_time[j, i] = time.time() - t0 t0 = time.time() hierarchy.ward(X) scipy_time[j, i] = time.time() - t0 ratio = scikits_time / scipy_time plt.figure("scikit-learn Ward's method benchmark results") plt.imshow(np.log(ratio), aspect='auto', origin="lower") plt.colorbar() plt.contour(ratio, levels=[1, ], colors='k') plt.yticks(range(len(n_features)), n_features.astype(np.int)) plt.ylabel('N features') plt.xticks(range(len(n_samples)), n_samples.astype(np.int)) plt.xlabel('N samples') plt.title("Scikit's time, in units of scipy time (log)") plt.show() scikit-learn-0.19.1/benchmarks/bench_random_projections.py000066400000000000000000000213041317344356400237260ustar00rootroot00000000000000""" =========================== Random projection benchmark =========================== Benchmarks for random projections. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import collections import numpy as np import scipy.sparse as sp from sklearn import clone from sklearn.externals.six.moves import xrange from sklearn.random_projection import (SparseRandomProjection, GaussianRandomProjection, johnson_lindenstrauss_min_dim) def type_auto_or_float(val): if val == "auto": return "auto" else: return float(val) def type_auto_or_int(val): if val == "auto": return "auto" else: return int(val) def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_scikit_transformer(X, transfomer): gc.collect() clf = clone(transfomer) # start time t_start = datetime.now() clf.fit(X) delta = (datetime.now() - t_start) # stop time time_to_fit = compute_time(t_start, delta) # start time t_start = datetime.now() clf.transform(X) delta = (datetime.now() - t_start) # stop time time_to_transform = compute_time(t_start, delta) return time_to_fit, time_to_transform # Make some random data with uniformly located non zero entries with # Gaussian distributed values def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=None): rng = np.random.RandomState(random_state) data_coo = sp.coo_matrix( (rng.randn(n_nonzeros), (rng.randint(n_samples, size=n_nonzeros), rng.randint(n_features, size=n_nonzeros))), shape=(n_samples, n_features)) return data_coo.toarray(), data_coo.tocsr() def print_row(clf_type, time_fit, time_transform): print("%s | %s | %s" % (clf_type.ljust(30), ("%.4fs" % time_fit).center(12), ("%.4fs" % time_transform).center(12))) if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-features", dest="n_features", default=10 ** 4, type=int, help="Number of features in the benchmarks") op.add_option("--n-components", dest="n_components", default="auto", help="Size of the random subspace." " ('auto' or int > 0)") op.add_option("--ratio-nonzeros", dest="ratio_nonzeros", default=10 ** -3, type=float, help="Number of features in the benchmarks") op.add_option("--n-samples", dest="n_samples", default=500, type=int, help="Number of samples in the benchmarks") op.add_option("--random-seed", dest="random_seed", default=13, type=int, help="Seed used by the random number generators.") op.add_option("--density", dest="density", default=1 / 3, help="Density used by the sparse random projection." " ('auto' or float (0.0, 1.0]") op.add_option("--eps", dest="eps", default=0.5, type=float, help="See the documentation of the underlying transformers.") op.add_option("--transformers", dest="selected_transformers", default='GaussianRandomProjection,SparseRandomProjection', type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. Available: " "GaussianRandomProjection,SparseRandomProjection") op.add_option("--dense", dest="dense", default=False, action="store_true", help="Set input space as a dense matrix.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) opts.n_components = type_auto_or_int(opts.n_components) opts.density = type_auto_or_float(opts.density) selected_transformers = opts.selected_transformers.split(',') ########################################################################### # Generate dataset ########################################################################### n_nonzeros = int(opts.ratio_nonzeros * opts.n_features) print('Dataset statics') print("===========================") print('n_samples \t= %s' % opts.n_samples) print('n_features \t= %s' % opts.n_features) if opts.n_components == "auto": print('n_components \t= %s (auto)' % johnson_lindenstrauss_min_dim(n_samples=opts.n_samples, eps=opts.eps)) else: print('n_components \t= %s' % opts.n_components) print('n_elements \t= %s' % (opts.n_features * opts.n_samples)) print('n_nonzeros \t= %s per feature' % n_nonzeros) print('ratio_nonzeros \t= %s' % opts.ratio_nonzeros) print('') ########################################################################### # Set transformer input ########################################################################### transformers = {} ########################################################################### # Set GaussianRandomProjection input gaussian_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed } transformers["GaussianRandomProjection"] = \ GaussianRandomProjection(**gaussian_matrix_params) ########################################################################### # Set SparseRandomProjection input sparse_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed, "density": opts.density, "eps": opts.eps, } transformers["SparseRandomProjection"] = \ SparseRandomProjection(**sparse_matrix_params) ########################################################################### # Perform benchmark ########################################################################### time_fit = collections.defaultdict(list) time_transform = collections.defaultdict(list) print('Benchmarks') print("===========================") print("Generate dataset benchmarks... ", end="") X_dense, X_sparse = make_sparse_random_data(opts.n_samples, opts.n_features, n_nonzeros, random_state=opts.random_seed) X = X_dense if opts.dense else X_sparse print("done") for name in selected_transformers: print("Perform benchmarks for %s..." % name) for iteration in xrange(opts.n_times): print("\titer %s..." % iteration, end="") time_to_fit, time_to_transform = bench_scikit_transformer(X_dense, transformers[name]) time_fit[name].append(time_to_fit) time_transform[name].append(time_to_transform) print("done") print("") ########################################################################### # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Transformer performance:") print("===========================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") print("%s | %s | %s" % ("Transformer".ljust(30), "fit".center(12), "transform".center(12))) print(31 * "-" + ("|" + "-" * 14) * 2) for name in sorted(selected_transformers): print_row(name, np.mean(time_fit[name]), np.mean(time_transform[name])) print("") print("") scikit-learn-0.19.1/benchmarks/bench_rcv1_logreg_convergence.py000066400000000000000000000160751317344356400246300ustar00rootroot00000000000000# Authors: Tom Dupre la Tour # Olivier Grisel # # License: BSD 3 clause import matplotlib.pyplot as plt import numpy as np import gc import time from sklearn.externals.joblib import Memory from sklearn.linear_model import (LogisticRegression, SGDClassifier) from sklearn.datasets import fetch_rcv1 from sklearn.linear_model.sag import get_auto_step_size try: import lightning.classification as lightning_clf except ImportError: lightning_clf = None m = Memory(cachedir='.', verbose=0) # compute logistic loss def get_loss(w, intercept, myX, myy, C): n_samples = myX.shape[0] w = w.ravel() p = np.mean(np.log(1. + np.exp(-myy * (myX.dot(w) + intercept)))) print("%f + %f" % (p, w.dot(w) / 2. / C / n_samples)) p += w.dot(w) / 2. / C / n_samples return p # We use joblib to cache individual fits. Note that we do not pass the dataset # as argument as the hashing would be too slow, so we assume that the dataset # never changes. @m.cache() def bench_one(name, clf_type, clf_params, n_iter): clf = clf_type(**clf_params) try: clf.set_params(max_iter=n_iter, random_state=42) except: clf.set_params(n_iter=n_iter, random_state=42) st = time.time() clf.fit(X, y) end = time.time() try: C = 1.0 / clf.alpha / n_samples except: C = clf.C try: intercept = clf.intercept_ except: intercept = 0. train_loss = get_loss(clf.coef_, intercept, X, y, C) train_score = clf.score(X, y) test_score = clf.score(X_test, y_test) duration = end - st return train_loss, train_score, test_score, duration def bench(clfs): for (name, clf, iter_range, train_losses, train_scores, test_scores, durations) in clfs: print("training %s" % name) clf_type = type(clf) clf_params = clf.get_params() for n_iter in iter_range: gc.collect() train_loss, train_score, test_score, duration = bench_one( name, clf_type, clf_params, n_iter) train_losses.append(train_loss) train_scores.append(train_score) test_scores.append(test_score) durations.append(duration) print("classifier: %s" % name) print("train_loss: %.8f" % train_loss) print("train_score: %.8f" % train_score) print("test_score: %.8f" % test_score) print("time for fit: %.8f seconds" % duration) print("") print("") return clfs def plot_train_losses(clfs): plt.figure() for (name, _, _, train_losses, _, _, durations) in clfs: plt.plot(durations, train_losses, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("train loss") def plot_train_scores(clfs): plt.figure() for (name, _, _, _, train_scores, _, durations) in clfs: plt.plot(durations, train_scores, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("train score") plt.ylim((0.92, 0.96)) def plot_test_scores(clfs): plt.figure() for (name, _, _, _, _, test_scores, durations) in clfs: plt.plot(durations, test_scores, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("test score") plt.ylim((0.92, 0.96)) def plot_dloss(clfs): plt.figure() pobj_final = [] for (name, _, _, train_losses, _, _, durations) in clfs: pobj_final.append(train_losses[-1]) indices = np.argsort(pobj_final) pobj_best = pobj_final[indices[0]] for (name, _, _, train_losses, _, _, durations) in clfs: log_pobj = np.log(abs(np.array(train_losses) - pobj_best)) / np.log(10) plt.plot(durations, log_pobj, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("log(best - train_loss)") def get_max_squared_sum(X): """Get the maximum row-wise sum of squares""" return np.sum(X ** 2, axis=1).max() rcv1 = fetch_rcv1() X = rcv1.data n_samples, n_features = X.shape # consider the binary classification problem 'CCAT' vs the rest ccat_idx = rcv1.target_names.tolist().index('CCAT') y = rcv1.target.tocsc()[:, ccat_idx].toarray().ravel().astype(np.float64) y[y == 0] = -1 # parameters C = 1. fit_intercept = True tol = 1.0e-14 # max_iter range sgd_iter_range = list(range(1, 121, 10)) newton_iter_range = list(range(1, 25, 3)) lbfgs_iter_range = list(range(1, 242, 12)) liblinear_iter_range = list(range(1, 37, 3)) liblinear_dual_iter_range = list(range(1, 85, 6)) sag_iter_range = list(range(1, 37, 3)) clfs = [ ("LR-liblinear", LogisticRegression(C=C, tol=tol, solver="liblinear", fit_intercept=fit_intercept, intercept_scaling=1), liblinear_iter_range, [], [], [], []), ("LR-liblinear-dual", LogisticRegression(C=C, tol=tol, dual=True, solver="liblinear", fit_intercept=fit_intercept, intercept_scaling=1), liblinear_dual_iter_range, [], [], [], []), ("LR-SAG", LogisticRegression(C=C, tol=tol, solver="sag", fit_intercept=fit_intercept), sag_iter_range, [], [], [], []), ("LR-newton-cg", LogisticRegression(C=C, tol=tol, solver="newton-cg", fit_intercept=fit_intercept), newton_iter_range, [], [], [], []), ("LR-lbfgs", LogisticRegression(C=C, tol=tol, solver="lbfgs", fit_intercept=fit_intercept), lbfgs_iter_range, [], [], [], []), ("SGD", SGDClassifier(alpha=1.0 / C / n_samples, penalty='l2', loss='log', fit_intercept=fit_intercept, verbose=0), sgd_iter_range, [], [], [], [])] if lightning_clf is not None and not fit_intercept: alpha = 1. / C / n_samples # compute the same step_size than in LR-sag max_squared_sum = get_max_squared_sum(X) step_size = get_auto_step_size(max_squared_sum, alpha, "log", fit_intercept) clfs.append( ("Lightning-SVRG", lightning_clf.SVRGClassifier(alpha=alpha, eta=step_size, tol=tol, loss="log"), sag_iter_range, [], [], [], [])) clfs.append( ("Lightning-SAG", lightning_clf.SAGClassifier(alpha=alpha, eta=step_size, tol=tol, loss="log"), sag_iter_range, [], [], [], [])) # We keep only 200 features, to have a dense dataset, # and compare to lightning SAG, which seems incorrect in the sparse case. X_csc = X.tocsc() nnz_in_each_features = X_csc.indptr[1:] - X_csc.indptr[:-1] X = X_csc[:, np.argsort(nnz_in_each_features)[-200:]] X = X.toarray() print("dataset: %.3f MB" % (X.nbytes / 1e6)) # Split training and testing. Switch train and test subset compared to # LYRL2004 split, to have a larger training dataset. n = 23149 X_test = X[:n, :] y_test = y[:n] X = X[n:, :] y = y[n:] clfs = bench(clfs) plot_train_scores(clfs) plot_test_scores(clfs) plot_train_losses(clfs) plot_dloss(clfs) plt.show() scikit-learn-0.19.1/benchmarks/bench_saga.py000066400000000000000000000204321317344356400207430ustar00rootroot00000000000000"""Author: Arthur Mensch Benchmarks of sklearn SAGA vs lightning SAGA vs Liblinear. Shows the gain in using multinomial logistic regression in term of learning time. """ import json import time from os.path import expanduser import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import fetch_rcv1, load_iris, load_digits, \ fetch_20newsgroups_vectorized from sklearn.externals.joblib import delayed, Parallel, Memory from sklearn.linear_model import LogisticRegression from sklearn.metrics import log_loss from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelBinarizer, LabelEncoder from sklearn.utils.extmath import safe_sparse_dot, softmax def fit_single(solver, X, y, penalty='l2', single_target=True, C=1, max_iter=10, skip_slow=False): if skip_slow and solver == 'lightning' and penalty == 'l1': print('skip_slowping l1 logistic regression with solver lightning.') return print('Solving %s logistic regression with penalty %s, solver %s.' % ('binary' if single_target else 'multinomial', penalty, solver)) if solver == 'lightning': from lightning.classification import SAGAClassifier if single_target or solver not in ['sag', 'saga']: multi_class = 'ovr' else: multi_class = 'multinomial' X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y) n_samples = X_train.shape[0] n_classes = np.unique(y_train).shape[0] test_scores = [1] train_scores = [1] accuracies = [1 / n_classes] times = [0] if penalty == 'l2': alpha = 1. / (C * n_samples) beta = 0 lightning_penalty = None else: alpha = 0. beta = 1. / (C * n_samples) lightning_penalty = 'l1' for this_max_iter in range(1, max_iter + 1, 2): print('[%s, %s, %s] Max iter: %s' % ('binary' if single_target else 'multinomial', penalty, solver, this_max_iter)) if solver == 'lightning': lr = SAGAClassifier(loss='log', alpha=alpha, beta=beta, penalty=lightning_penalty, tol=-1, max_iter=this_max_iter) else: lr = LogisticRegression(solver=solver, multi_class=multi_class, C=C, penalty=penalty, fit_intercept=False, tol=1e-24, max_iter=this_max_iter, random_state=42, ) t0 = time.clock() lr.fit(X_train, y_train) train_time = time.clock() - t0 scores = [] for (X, y) in [(X_train, y_train), (X_test, y_test)]: try: y_pred = lr.predict_proba(X) except NotImplementedError: # Lightning predict_proba is not implemented for n_classes > 2 y_pred = _predict_proba(lr, X) score = log_loss(y, y_pred, normalize=False) / n_samples score += (0.5 * alpha * np.sum(lr.coef_ ** 2) + beta * np.sum(np.abs(lr.coef_))) scores.append(score) train_score, test_score = tuple(scores) y_pred = lr.predict(X_test) accuracy = np.sum(y_pred == y_test) / y_test.shape[0] test_scores.append(test_score) train_scores.append(train_score) accuracies.append(accuracy) times.append(train_time) return lr, times, train_scores, test_scores, accuracies def _predict_proba(lr, X): pred = safe_sparse_dot(X, lr.coef_.T) if hasattr(lr, "intercept_"): pred += lr.intercept_ return softmax(pred) def exp(solvers, penalties, single_target, n_samples=30000, max_iter=20, dataset='rcv1', n_jobs=1, skip_slow=False): mem = Memory(cachedir=expanduser('~/cache'), verbose=0) if dataset == 'rcv1': rcv1 = fetch_rcv1() lbin = LabelBinarizer() lbin.fit(rcv1.target_names) X = rcv1.data y = rcv1.target y = lbin.inverse_transform(y) le = LabelEncoder() y = le.fit_transform(y) if single_target: y_n = y.copy() y_n[y > 16] = 1 y_n[y <= 16] = 0 y = y_n elif dataset == 'digits': digits = load_digits() X, y = digits.data, digits.target if single_target: y_n = y.copy() y_n[y < 5] = 1 y_n[y >= 5] = 0 y = y_n elif dataset == 'iris': iris = load_iris() X, y = iris.data, iris.target elif dataset == '20newspaper': ng = fetch_20newsgroups_vectorized() X = ng.data y = ng.target if single_target: y_n = y.copy() y_n[y > 4] = 1 y_n[y <= 16] = 0 y = y_n X = X[:n_samples] y = y[:n_samples] cached_fit = mem.cache(fit_single) out = Parallel(n_jobs=n_jobs, mmap_mode=None)( delayed(cached_fit)(solver, X, y, penalty=penalty, single_target=single_target, C=1, max_iter=max_iter, skip_slow=skip_slow) for solver in solvers for penalty in penalties) res = [] idx = 0 for solver in solvers: for penalty in penalties: if not (skip_slow and solver == 'lightning' and penalty == 'l1'): lr, times, train_scores, test_scores, accuracies = out[idx] this_res = dict(solver=solver, penalty=penalty, single_target=single_target, times=times, train_scores=train_scores, test_scores=test_scores, accuracies=accuracies) res.append(this_res) idx += 1 with open('bench_saga.json', 'w+') as f: json.dump(res, f) def plot(): import pandas as pd with open('bench_saga.json', 'r') as f: f = json.load(f) res = pd.DataFrame(f) res.set_index(['single_target', 'penalty'], inplace=True) grouped = res.groupby(level=['single_target', 'penalty']) colors = {'saga': 'blue', 'liblinear': 'orange', 'lightning': 'green'} for idx, group in grouped: single_target, penalty = idx fig = plt.figure(figsize=(12, 4)) ax = fig.add_subplot(131) train_scores = group['train_scores'].values ref = np.min(np.concatenate(train_scores)) * 0.999 for scores, times, solver in zip(group['train_scores'], group['times'], group['solver']): scores = scores / ref - 1 ax.plot(times, scores, label=solver, color=colors[solver]) ax.set_xlabel('Time (s)') ax.set_ylabel('Training objective (relative to min)') ax.set_yscale('log') ax = fig.add_subplot(132) test_scores = group['test_scores'].values ref = np.min(np.concatenate(test_scores)) * 0.999 for scores, times, solver in zip(group['test_scores'], group['times'], group['solver']): scores = scores / ref - 1 ax.plot(times, scores, label=solver, color=colors[solver]) ax.set_xlabel('Time (s)') ax.set_ylabel('Test objective (relative to min)') ax.set_yscale('log') ax = fig.add_subplot(133) for accuracy, times, solver in zip(group['accuracies'], group['times'], group['solver']): ax.plot(times, accuracy, label=solver, color=colors[solver]) ax.set_xlabel('Time (s)') ax.set_ylabel('Test accuracy') ax.legend() name = 'single_target' if single_target else 'multi_target' name += '_%s' % penalty plt.suptitle(name) name += '.png' fig.tight_layout() fig.subplots_adjust(top=0.9) plt.savefig(name) plt.close(fig) if __name__ == '__main__': solvers = ['saga', 'liblinear', 'lightning'] penalties = ['l1', 'l2'] single_target = True exp(solvers, penalties, single_target, n_samples=None, n_jobs=1, dataset='20newspaper', max_iter=20) plot() scikit-learn-0.19.1/benchmarks/bench_sample_without_replacement.py000066400000000000000000000175101317344356400254560ustar00rootroot00000000000000""" Benchmarks for sampling without replacement of integer. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import operator import matplotlib.pyplot as plt import numpy as np import random from sklearn.externals.six.moves import xrange from sklearn.utils.random import sample_without_replacement def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_sample(sampling, n_population, n_samples): gc.collect() # start time t_start = datetime.now() sampling(n_population, n_samples) delta = (datetime.now() - t_start) # stop time time = compute_time(t_start, delta) return time if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-population", dest="n_population", default=100000, type=int, help="Size of the population to sample from.") op.add_option("--n-step", dest="n_steps", default=5, type=int, help="Number of step interval between 0 and n_population.") default_algorithms = "custom-tracking-selection,custom-auto," \ "custom-reservoir-sampling,custom-pool,"\ "python-core-sample,numpy-permutation" op.add_option("--algorithm", dest="selected_algorithm", default=default_algorithms, type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. \nAvailable: %default") # op.add_option("--random-seed", # dest="random_seed", default=13, type=int, # help="Seed used by the random number generators.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) selected_algorithm = opts.selected_algorithm.split(',') for key in selected_algorithm: if key not in default_algorithms.split(','): raise ValueError("Unknown sampling algorithm \"%s\" not in (%s)." % (key, default_algorithms)) ########################################################################### # List sampling algorithm ########################################################################### # We assume that sampling algorithm has the following signature: # sample(n_population, n_sample) # sampling_algorithm = {} ########################################################################### # Set Python core input sampling_algorithm["python-core-sample"] = \ lambda n_population, n_sample: \ random.sample(xrange(n_population), n_sample) ########################################################################### # Set custom automatic method selection sampling_algorithm["custom-auto"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="auto", random_state=random_state) ########################################################################### # Set custom tracking based method sampling_algorithm["custom-tracking-selection"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="tracking_selection", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-reservoir-sampling"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="reservoir_sampling", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-pool"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="pool", random_state=random_state) ########################################################################### # Numpy permutation based sampling_algorithm["numpy-permutation"] = \ lambda n_population, n_sample: \ np.random.permutation(n_population)[:n_sample] ########################################################################### # Remove unspecified algorithm sampling_algorithm = dict((key, value) for key, value in sampling_algorithm.items() if key in selected_algorithm) ########################################################################### # Perform benchmark ########################################################################### time = {} n_samples = np.linspace(start=0, stop=opts.n_population, num=opts.n_steps).astype(np.int) ratio = n_samples / opts.n_population print('Benchmarks') print("===========================") for name in sorted(sampling_algorithm): print("Perform benchmarks for %s..." % name, end="") time[name] = np.zeros(shape=(opts.n_steps, opts.n_times)) for step in xrange(opts.n_steps): for it in xrange(opts.n_times): time[name][step, it] = bench_sample(sampling_algorithm[name], opts.n_population, n_samples[step]) print("done") print("Averaging results...", end="") for name in sampling_algorithm: time[name] = np.mean(time[name], axis=1) print("done\n") # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Sampling algorithm performance:") print("===============================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") fig = plt.figure('scikit-learn sample w/o replacement benchmark results') plt.title("n_population = %s, n_times = %s" % (opts.n_population, opts.n_times)) ax = fig.add_subplot(111) for name in sampling_algorithm: ax.plot(ratio, time[name], label=name) ax.set_xlabel('ratio of n_sample / n_population') ax.set_ylabel('Time (s)') ax.legend() # Sort legend labels handles, labels = ax.get_legend_handles_labels() hl = sorted(zip(handles, labels), key=operator.itemgetter(1)) handles2, labels2 = zip(*hl) ax.legend(handles2, labels2, loc=0) plt.show() scikit-learn-0.19.1/benchmarks/bench_sgd_regression.py000066400000000000000000000127011317344356400230450ustar00rootroot00000000000000# Author: Peter Prettenhofer # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt import gc from time import time from sklearn.linear_model import Ridge, SGDRegressor, ElasticNet from sklearn.metrics import mean_squared_error from sklearn.datasets.samples_generator import make_regression """ Benchmark for SGD regression Compares SGD regression against coordinate descent and Ridge on synthetic data. """ print(__doc__) if __name__ == "__main__": list_n_samples = np.linspace(100, 10000, 5).astype(np.int) list_n_features = [10, 100, 1000] n_test = 1000 max_iter = 1000 noise = 0.1 alpha = 0.01 sgd_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) elnet_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) ridge_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) asgd_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) for i, n_train in enumerate(list_n_samples): for j, n_features in enumerate(list_n_features): X, y, coef = make_regression( n_samples=n_train + n_test, n_features=n_features, noise=noise, coef=True) X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] print("=======================") print("Round %d %d" % (i, j)) print("n_features:", n_features) print("n_samples:", n_train) # Shuffle data idx = np.arange(n_train) np.random.seed(13) np.random.shuffle(idx) X_train = X_train[idx] y_train = y_train[idx] std = X_train.std(axis=0) mean = X_train.mean(axis=0) X_train = (X_train - mean) / std X_test = (X_test - mean) / std std = y_train.std(axis=0) mean = y_train.mean(axis=0) y_train = (y_train - mean) / std y_test = (y_test - mean) / std gc.collect() print("- benchmarking ElasticNet") clf = ElasticNet(alpha=alpha, l1_ratio=0.5, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) elnet_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) elnet_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking SGD") clf = SGDRegressor(alpha=alpha / n_train, fit_intercept=False, max_iter=max_iter, learning_rate="invscaling", eta0=.01, power_t=0.25, tol=1e-3) tstart = time() clf.fit(X_train, y_train) sgd_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) sgd_results[i, j, 1] = time() - tstart gc.collect() print("max_iter", max_iter) print("- benchmarking A-SGD") clf = SGDRegressor(alpha=alpha / n_train, fit_intercept=False, max_iter=max_iter, learning_rate="invscaling", eta0=.002, power_t=0.05, tol=1e-3, average=(max_iter * n_train // 2)) tstart = time() clf.fit(X_train, y_train) asgd_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) asgd_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking RidgeRegression") clf = Ridge(alpha=alpha, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) ridge_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) ridge_results[i, j, 1] = time() - tstart # Plot results i = 0 m = len(list_n_features) plt.figure('scikit-learn SGD regression benchmark results', figsize=(5 * 2, 4 * m)) for j in range(m): plt.subplot(m, 2, i + 1) plt.plot(list_n_samples, np.sqrt(elnet_results[:, j, 0]), label="ElasticNet") plt.plot(list_n_samples, np.sqrt(sgd_results[:, j, 0]), label="SGDRegressor") plt.plot(list_n_samples, np.sqrt(asgd_results[:, j, 0]), label="A-SGDRegressor") plt.plot(list_n_samples, np.sqrt(ridge_results[:, j, 0]), label="Ridge") plt.legend(prop={"size": 10}) plt.xlabel("n_train") plt.ylabel("RMSE") plt.title("Test error - %d features" % list_n_features[j]) i += 1 plt.subplot(m, 2, i + 1) plt.plot(list_n_samples, np.sqrt(elnet_results[:, j, 1]), label="ElasticNet") plt.plot(list_n_samples, np.sqrt(sgd_results[:, j, 1]), label="SGDRegressor") plt.plot(list_n_samples, np.sqrt(asgd_results[:, j, 1]), label="A-SGDRegressor") plt.plot(list_n_samples, np.sqrt(ridge_results[:, j, 1]), label="Ridge") plt.legend(prop={"size": 10}) plt.xlabel("n_train") plt.ylabel("Time [sec]") plt.title("Training time - %d features" % list_n_features[j]) i += 1 plt.subplots_adjust(hspace=.30) plt.show() scikit-learn-0.19.1/benchmarks/bench_sparsify.py000066400000000000000000000065221317344356400216740ustar00rootroot00000000000000""" Benchmark SGD prediction time with dense/sparse coefficients. Invoke with ----------- $ kernprof.py -l sparsity_benchmark.py $ python -m line_profiler sparsity_benchmark.py.lprof Typical output -------------- input data sparsity: 0.050000 true coef sparsity: 0.000100 test data sparsity: 0.027400 model sparsity: 0.000024 r^2 on test data (dense model) : 0.233651 r^2 on test data (sparse model) : 0.233651 Wrote profile results to sparsity_benchmark.py.lprof Timer unit: 1e-06 s File: sparsity_benchmark.py Function: benchmark_dense_predict at line 51 Total time: 0.532979 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 51 @profile 52 def benchmark_dense_predict(): 53 301 640 2.1 0.1 for _ in range(300): 54 300 532339 1774.5 99.9 clf.predict(X_test) File: sparsity_benchmark.py Function: benchmark_sparse_predict at line 56 Total time: 0.39274 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 56 @profile 57 def benchmark_sparse_predict(): 58 1 10854 10854.0 2.8 X_test_sparse = csr_matrix(X_test) 59 301 477 1.6 0.1 for _ in range(300): 60 300 381409 1271.4 97.1 clf.predict(X_test_sparse) """ from scipy.sparse.csr import csr_matrix import numpy as np from sklearn.linear_model.stochastic_gradient import SGDRegressor from sklearn.metrics import r2_score np.random.seed(42) def sparsity_ratio(X): return np.count_nonzero(X) / float(n_samples * n_features) n_samples, n_features = 5000, 300 X = np.random.randn(n_samples, n_features) inds = np.arange(n_samples) np.random.shuffle(inds) X[inds[int(n_features / 1.2):]] = 0 # sparsify input print("input data sparsity: %f" % sparsity_ratio(X)) coef = 3 * np.random.randn(n_features) inds = np.arange(n_features) np.random.shuffle(inds) coef[inds[n_features // 2:]] = 0 # sparsify coef print("true coef sparsity: %f" % sparsity_ratio(coef)) y = np.dot(X, coef) # add noise y += 0.01 * np.random.normal((n_samples,)) # Split data in train set and test set n_samples = X.shape[0] X_train, y_train = X[:n_samples // 2], y[:n_samples // 2] X_test, y_test = X[n_samples // 2:], y[n_samples // 2:] print("test data sparsity: %f" % sparsity_ratio(X_test)) ############################################################################### clf = SGDRegressor(penalty='l1', alpha=.2, fit_intercept=True, max_iter=2000, tol=None) clf.fit(X_train, y_train) print("model sparsity: %f" % sparsity_ratio(clf.coef_)) def benchmark_dense_predict(): for _ in range(300): clf.predict(X_test) def benchmark_sparse_predict(): X_test_sparse = csr_matrix(X_test) for _ in range(300): clf.predict(X_test_sparse) def score(y_test, y_pred, case): r2 = r2_score(y_test, y_pred) print("r^2 on test data (%s) : %f" % (case, r2)) score(y_test, clf.predict(X_test), 'dense model') benchmark_dense_predict() clf.sparsify() score(y_test, clf.predict(X_test), 'sparse model') benchmark_sparse_predict() scikit-learn-0.19.1/benchmarks/bench_text_vectorizers.py000066400000000000000000000041001317344356400234450ustar00rootroot00000000000000""" To run this benchmark, you will need, * scikit-learn * pandas * memory_profiler * psutil (optional, but recommended) """ from __future__ import print_function import timeit import itertools import numpy as np import pandas as pd from memory_profiler import memory_usage from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import (CountVectorizer, TfidfVectorizer, HashingVectorizer) n_repeat = 3 def run_vectorizer(Vectorizer, X, **params): def f(): vect = Vectorizer(**params) vect.fit_transform(X) return f text = fetch_20newsgroups(subset='train').data print("="*80 + '\n#' + " Text vectorizers benchmark" + '\n' + '='*80 + '\n') print("Using a subset of the 20 newsrgoups dataset ({} documents)." .format(len(text))) print("This benchmarks runs in ~20 min ...") res = [] for Vectorizer, (analyzer, ngram_range) in itertools.product( [CountVectorizer, TfidfVectorizer, HashingVectorizer], [('word', (1, 1)), ('word', (1, 2)), ('word', (1, 4)), ('char', (4, 4)), ('char_wb', (4, 4)) ]): bench = {'vectorizer': Vectorizer.__name__} params = {'analyzer': analyzer, 'ngram_range': ngram_range} bench.update(params) dt = timeit.repeat(run_vectorizer(Vectorizer, text, **params), number=1, repeat=n_repeat) bench['time'] = "{:.2f} (+-{:.2f})".format(np.mean(dt), np.std(dt)) mem_usage = memory_usage(run_vectorizer(Vectorizer, text, **params)) bench['memory'] = "{:.1f}".format(np.max(mem_usage)) res.append(bench) df = pd.DataFrame(res).set_index(['analyzer', 'ngram_range', 'vectorizer']) print('\n========== Run time performance (sec) ===========\n') print('Computing the mean and the standard deviation ' 'of the run time over {} runs...\n'.format(n_repeat)) print(df['time'].unstack(level=-1)) print('\n=============== Memory usage (MB) ===============\n') print(df['memory'].unstack(level=-1)) scikit-learn-0.19.1/benchmarks/bench_tree.py000066400000000000000000000070771317344356400210010ustar00rootroot00000000000000""" To run this, you'll need to have installed. * scikit-learn Does two benchmarks First, we fix a training set, increase the number of samples to classify and plot number of classified samples as a function of time. In the second benchmark, we increase the number of dimensions of the training set, classify a sample and plot the time taken as a function of the number of dimensions. """ import numpy as np import matplotlib.pyplot as plt import gc from datetime import datetime # to store the results scikit_classifier_results = [] scikit_regressor_results = [] mu_second = 0.0 + 10 ** 6 # number of microseconds in a second def bench_scikit_tree_classifier(X, Y): """Benchmark with scikit-learn decision tree classifier""" from sklearn.tree import DecisionTreeClassifier gc.collect() # start time tstart = datetime.now() clf = DecisionTreeClassifier() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_classifier_results.append( delta.seconds + delta.microseconds / mu_second) def bench_scikit_tree_regressor(X, Y): """Benchmark with scikit-learn decision tree regressor""" from sklearn.tree import DecisionTreeRegressor gc.collect() # start time tstart = datetime.now() clf = DecisionTreeRegressor() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_regressor_results.append( delta.seconds + delta.microseconds / mu_second) if __name__ == '__main__': print('============================================') print('Warning: this is going to take a looong time') print('============================================') n = 10 step = 10000 n_samples = 10000 dim = 10 n_classes = 10 for i in range(n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') n_samples += step X = np.random.randn(n_samples, dim) Y = np.random.randint(0, n_classes, (n_samples,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(n_samples) bench_scikit_tree_regressor(X, Y) xx = range(0, n * step, step) plt.figure('scikit-learn tree benchmark results') plt.subplot(211) plt.title('Learning with varying number of samples') plt.plot(xx, scikit_classifier_results, 'g-', label='classification') plt.plot(xx, scikit_regressor_results, 'r-', label='regression') plt.legend(loc='upper left') plt.xlabel('number of samples') plt.ylabel('Time (s)') scikit_classifier_results = [] scikit_regressor_results = [] n = 10 step = 500 start_dim = 500 n_classes = 10 dim = start_dim for i in range(0, n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') dim += step X = np.random.randn(100, dim) Y = np.random.randint(0, n_classes, (100,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(100) bench_scikit_tree_regressor(X, Y) xx = np.arange(start_dim, start_dim + n * step, step) plt.subplot(212) plt.title('Learning in high dimensional spaces') plt.plot(xx, scikit_classifier_results, 'g-', label='classification') plt.plot(xx, scikit_regressor_results, 'r-', label='regression') plt.legend(loc='upper left') plt.xlabel('number of dimensions') plt.ylabel('Time (s)') plt.axis('tight') plt.show() scikit-learn-0.19.1/benchmarks/bench_tsne_mnist.py000066400000000000000000000137001317344356400222130ustar00rootroot00000000000000""" ============================= MNIST dataset T-SNE benchmark ============================= """ from __future__ import division, print_function # License: BSD 3 clause import os import os.path as op from time import time import numpy as np import json import argparse from sklearn.externals.joblib import Memory from sklearn.datasets import fetch_mldata from sklearn.manifold import TSNE from sklearn.neighbors import NearestNeighbors from sklearn.decomposition import PCA from sklearn.utils import check_array from sklearn.utils import shuffle as _shuffle LOG_DIR = "mnist_tsne_output" if not os.path.exists(LOG_DIR): os.mkdir(LOG_DIR) memory = Memory(os.path.join(LOG_DIR, 'mnist_tsne_benchmark_data'), mmap_mode='r') @memory.cache def load_data(dtype=np.float32, order='C', shuffle=True, seed=0): """Load the data, then cache and memmap the train/test split""" print("Loading dataset...") data = fetch_mldata('MNIST original') X = check_array(data['data'], dtype=dtype, order=order) y = data["target"] if shuffle: X, y = _shuffle(X, y, random_state=seed) # Normalize features X /= 255 return X, y def nn_accuracy(X, X_embedded, k=1): """Accuracy of the first nearest neighbor""" knn = NearestNeighbors(n_neighbors=1, n_jobs=-1) _, neighbors_X = knn.fit(X).kneighbors() _, neighbors_X_embedded = knn.fit(X_embedded).kneighbors() return np.mean(neighbors_X == neighbors_X_embedded) def tsne_fit_transform(model, data): transformed = model.fit_transform(data) return transformed, model.n_iter_ def sanitize(filename): return filename.replace("/", '-').replace(" ", "_") if __name__ == "__main__": parser = argparse.ArgumentParser('Benchmark for t-SNE') parser.add_argument('--order', type=str, default='C', help='Order of the input data') parser.add_argument('--perplexity', type=float, default=30) parser.add_argument('--bhtsne', action='store_true', help="if set and the reference bhtsne code is " "correctly installed, run it in the benchmark.") parser.add_argument('--all', action='store_true', help="if set, run the benchmark with the whole MNIST." "dataset. Note that it will take up to 1 hour.") parser.add_argument('--profile', action='store_true', help="if set, run the benchmark with a memory " "profiler.") parser.add_argument('--verbose', type=int, default=0) parser.add_argument('--pca-components', type=int, default=50, help="Number of principal components for " "preprocessing.") args = parser.parse_args() X, y = load_data(order=args.order) if args.pca_components > 0: t0 = time() X = PCA(n_components=args.pca_components).fit_transform(X) print("PCA preprocessing down to {} dimensions took {:0.3f}s" .format(args.pca_components, time() - t0)) methods = [] # Put TSNE in methods tsne = TSNE(n_components=2, init='pca', perplexity=args.perplexity, verbose=args.verbose, n_iter=1000) methods.append(("sklearn TSNE", lambda data: tsne_fit_transform(tsne, data))) if args.bhtsne: try: from bhtsne.bhtsne import run_bh_tsne except ImportError: raise ImportError("""\ If you want comparison with the reference implementation, build the binary from source (https://github.com/lvdmaaten/bhtsne) in the folder benchmarks/bhtsne and add an empty `__init__.py` file in the folder: $ git clone git@github.com:lvdmaaten/bhtsne.git $ cd bhtsne $ g++ sptree.cpp tsne.cpp tsne_main.cpp -o bh_tsne -O2 $ touch __init__.py $ cd .. """) def bhtsne(X): """Wrapper for the reference lvdmaaten/bhtsne implementation.""" # PCA preprocessing is done elsewhere in the benchmark script n_iter = -1 # TODO find a way to report the number of iterations return run_bh_tsne(X, use_pca=False, perplexity=args.perplexity, verbose=args.verbose > 0), n_iter methods.append(("lvdmaaten/bhtsne", bhtsne)) if args.profile: try: from memory_profiler import profile except ImportError: raise ImportError("To run the benchmark with `--profile`, you " "need to install `memory_profiler`. Please " "run `pip install memory_profiler`.") methods = [(n, profile(m)) for n, m in methods] data_size = [100, 500, 1000, 5000, 10000] if args.all: data_size.append(70000) results = [] basename, _ = os.path.splitext(__file__) log_filename = os.path.join(LOG_DIR, basename + '.json') for n in data_size: X_train = X[:n] y_train = y[:n] n = X_train.shape[0] for name, method in methods: print("Fitting {} on {} samples...".format(name, n)) t0 = time() np.save(os.path.join(LOG_DIR, 'mnist_{}_{}.npy' .format('original', n)), X_train) np.save(os.path.join(LOG_DIR, 'mnist_{}_{}.npy' .format('original_labels', n)), y_train) X_embedded, n_iter = method(X_train) duration = time() - t0 precision_5 = nn_accuracy(X_train, X_embedded) print("Fitting {} on {} samples took {:.3f}s in {:d} iterations, " "nn accuracy: {:0.3f}".format( name, n, duration, n_iter, precision_5)) results.append(dict(method=name, duration=duration, n_samples=n)) with open(log_filename, 'w', encoding='utf-8') as f: json.dump(results, f) method_name = sanitize(name) np.save(op.join(LOG_DIR, 'mnist_{}_{}.npy'.format(method_name, n)), X_embedded) scikit-learn-0.19.1/benchmarks/plot_tsne_mnist.py000066400000000000000000000014761317344356400221210ustar00rootroot00000000000000import matplotlib.pyplot as plt import numpy as np import os.path as op import argparse LOG_DIR = "mnist_tsne_output" if __name__ == "__main__": parser = argparse.ArgumentParser('Plot benchmark results for t-SNE') parser.add_argument( '--labels', type=str, default=op.join(LOG_DIR, 'mnist_original_labels_10000.npy'), help='1D integer numpy array for labels') parser.add_argument( '--embedding', type=str, default=op.join(LOG_DIR, 'mnist_sklearn_TSNE_10000.npy'), help='2D float numpy array for embedded data') args = parser.parse_args() X = np.load(args.embedding) y = np.load(args.labels) for i in np.unique(y): mask = y == i plt.scatter(X[mask, 0], X[mask, 1], alpha=0.2, label=int(i)) plt.legend(loc='best') plt.show() scikit-learn-0.19.1/build_tools/000077500000000000000000000000001317344356400165205ustar00rootroot00000000000000scikit-learn-0.19.1/build_tools/appveyor/000077500000000000000000000000001317344356400203655ustar00rootroot00000000000000scikit-learn-0.19.1/build_tools/appveyor/install.ps1000066400000000000000000000160331317344356400224630ustar00rootroot00000000000000# Sample script to install Python and pip under Windows # Authors: Olivier Grisel, Jonathan Helmus, Kyle Kastner, and Alex Willmer # License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ $MINICONDA_URL = "http://repo.continuum.io/miniconda/" $BASE_URL = "https://www.python.org/ftp/python/" $GET_PIP_URL = "https://bootstrap.pypa.io/get-pip.py" $GET_PIP_PATH = "C:\get-pip.py" $PYTHON_PRERELEASE_REGEX = @" (?x) (?\d+) \. (?\d+) \. (?\d+) (?[a-z]{1,2}\d+) "@ function Download ($filename, $url) { $webclient = New-Object System.Net.WebClient $basedir = $pwd.Path + "\" $filepath = $basedir + $filename if (Test-Path $filename) { Write-Host "Reusing" $filepath return $filepath } # Download and retry up to 3 times in case of network transient errors. Write-Host "Downloading" $filename "from" $url $retry_attempts = 2 for ($i = 0; $i -lt $retry_attempts; $i++) { try { $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } if (Test-Path $filepath) { Write-Host "File saved at" $filepath } else { # Retry once to get the error message if any at the last try $webclient.DownloadFile($url, $filepath) } return $filepath } function ParsePythonVersion ($python_version) { if ($python_version -match $PYTHON_PRERELEASE_REGEX) { return ([int]$matches.major, [int]$matches.minor, [int]$matches.micro, $matches.prerelease) } $version_obj = [version]$python_version return ($version_obj.major, $version_obj.minor, $version_obj.build, "") } function DownloadPython ($python_version, $platform_suffix) { $major, $minor, $micro, $prerelease = ParsePythonVersion $python_version if (($major -le 2 -and $micro -eq 0) ` -or ($major -eq 3 -and $minor -le 2 -and $micro -eq 0) ` ) { $dir = "$major.$minor" $python_version = "$major.$minor$prerelease" } else { $dir = "$major.$minor.$micro" } if ($prerelease) { if (($major -le 2) ` -or ($major -eq 3 -and $minor -eq 1) ` -or ($major -eq 3 -and $minor -eq 2) ` -or ($major -eq 3 -and $minor -eq 3) ` ) { $dir = "$dir/prev" } } if (($major -le 2) -or ($major -le 3 -and $minor -le 4)) { $ext = "msi" if ($platform_suffix) { $platform_suffix = ".$platform_suffix" } } else { $ext = "exe" if ($platform_suffix) { $platform_suffix = "-$platform_suffix" } } $filename = "python-$python_version$platform_suffix.$ext" $url = "$BASE_URL$dir/$filename" $filepath = Download $filename $url return $filepath } function InstallPython ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "" } else { $platform_suffix = "amd64" } $installer_path = DownloadPython $python_version $platform_suffix $installer_ext = [System.IO.Path]::GetExtension($installer_path) Write-Host "Installing $installer_path to $python_home" $install_log = $python_home + ".log" if ($installer_ext -eq '.msi') { InstallPythonMSI $installer_path $python_home $install_log } else { InstallPythonEXE $installer_path $python_home $install_log } if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function InstallPythonEXE ($exepath, $python_home, $install_log) { $install_args = "/quiet InstallAllUsers=1 TargetDir=$python_home" RunCommand $exepath $install_args } function InstallPythonMSI ($msipath, $python_home, $install_log) { $install_args = "/qn /log $install_log /i $msipath TARGETDIR=$python_home" $uninstall_args = "/qn /x $msipath" RunCommand "msiexec.exe" $install_args if (-not(Test-Path $python_home)) { Write-Host "Python seems to be installed else-where, reinstalling." RunCommand "msiexec.exe" $uninstall_args RunCommand "msiexec.exe" $install_args } } function RunCommand ($command, $command_args) { Write-Host $command $command_args Start-Process -FilePath $command -ArgumentList $command_args -Wait -Passthru } function InstallPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $python_path = $python_home + "\python.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $webclient = New-Object System.Net.WebClient $webclient.DownloadFile($GET_PIP_URL, $GET_PIP_PATH) Write-Host "Executing:" $python_path $GET_PIP_PATH & $python_path $GET_PIP_PATH } else { Write-Host "pip already installed." } } function DownloadMiniconda ($python_version, $platform_suffix) { if ($python_version -eq "3.4") { $filename = "Miniconda3-3.5.5-Windows-" + $platform_suffix + ".exe" } else { $filename = "Miniconda-3.5.5-Windows-" + $platform_suffix + ".exe" } $url = $MINICONDA_URL + $filename $filepath = Download $filename $url return $filepath } function InstallMiniconda ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "x86" } else { $platform_suffix = "x86_64" } $filepath = DownloadMiniconda $python_version $platform_suffix Write-Host "Installing" $filepath "to" $python_home $install_log = $python_home + ".log" $args = "/S /D=$python_home" Write-Host $filepath $args Start-Process -FilePath $filepath -ArgumentList $args -Wait -Passthru if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function InstallMinicondaPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $conda_path = $python_home + "\Scripts\conda.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $args = "install --yes pip" Write-Host $conda_path $args Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru } else { Write-Host "pip already installed." } } function main () { InstallPython $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON InstallPip $env:PYTHON } main scikit-learn-0.19.1/build_tools/appveyor/requirements.txt000066400000000000000000000011761317344356400236560ustar00rootroot00000000000000# Fetch numpy and scipy wheels from the sklearn rackspace wheelhouse. # Those wheels were collected from http://www.lfd.uci.edu/~gohlke/pythonlibs/ # This is a temporary solution. As soon as numpy and scipy provide official # wheel for windows we ca delete this --find-links line. --find-links http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/ # fix the versions of numpy to force the use of numpy and scipy to use the whl # of the rackspace folder instead of trying to install from more recent # source tarball published on PyPI numpy==1.9.3 scipy==0.16.0 cython nose nose-timer wheel wheelhouse_uploader scikit-learn-0.19.1/build_tools/appveyor/run_with_env.cmd000066400000000000000000000064461317344356400235730ustar00rootroot00000000000000:: To build extensions for 64 bit Python 3, we need to configure environment :: variables to use the MSVC 2010 C++ compilers from GRMSDKX_EN_DVD.iso of: :: MS Windows SDK for Windows 7 and .NET Framework 4 (SDK v7.1) :: :: To build extensions for 64 bit Python 2, we need to configure environment :: variables to use the MSVC 2008 C++ compilers from GRMSDKX_EN_DVD.iso of: :: MS Windows SDK for Windows 7 and .NET Framework 3.5 (SDK v7.0) :: :: 32 bit builds, and 64-bit builds for 3.5 and beyond, do not require specific :: environment configurations. :: :: Note: this script needs to be run with the /E:ON and /V:ON flags for the :: cmd interpreter, at least for (SDK v7.0) :: :: More details at: :: https://github.com/cython/cython/wiki/64BitCythonExtensionsOnWindows :: http://stackoverflow.com/a/13751649/163740 :: :: Author: Olivier Grisel :: License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ :: :: Notes about batch files for Python people: :: :: Quotes in values are literally part of the values: :: SET FOO="bar" :: FOO is now five characters long: " b a r " :: If you don't want quotes, don't include them on the right-hand side. :: :: The CALL lines at the end of this file look redundant, but if you move them :: outside of the IF clauses, they do not run properly in the SET_SDK_64==Y :: case, I don't know why. @ECHO OFF SET COMMAND_TO_RUN=%* SET WIN_SDK_ROOT=C:\Program Files\Microsoft SDKs\Windows SET WIN_WDK=c:\Program Files (x86)\Windows Kits\10\Include\wdf :: Extract the major and minor versions, and allow for the minor version to be :: more than 9. This requires the version number to have two dots in it. SET MAJOR_PYTHON_VERSION=%PYTHON_VERSION:~0,1% IF "%PYTHON_VERSION:~3,1%" == "." ( SET MINOR_PYTHON_VERSION=%PYTHON_VERSION:~2,1% ) ELSE ( SET MINOR_PYTHON_VERSION=%PYTHON_VERSION:~2,2% ) :: Based on the Python version, determine what SDK version to use, and whether :: to set the SDK for 64-bit. IF %MAJOR_PYTHON_VERSION% == 2 ( SET WINDOWS_SDK_VERSION="v7.0" SET SET_SDK_64=Y ) ELSE ( IF %MAJOR_PYTHON_VERSION% == 3 ( SET WINDOWS_SDK_VERSION="v7.1" IF %MINOR_PYTHON_VERSION% LEQ 4 ( SET SET_SDK_64=Y ) ELSE ( SET SET_SDK_64=N IF EXIST "%WIN_WDK%" ( :: See: https://connect.microsoft.com/VisualStudio/feedback/details/1610302/ REN "%WIN_WDK%" 0wdf ) ) ) ELSE ( ECHO Unsupported Python version: "%MAJOR_PYTHON_VERSION%" EXIT 1 ) ) IF %PYTHON_ARCH% == 64 ( IF %SET_SDK_64% == Y ( ECHO Configuring Windows SDK %WINDOWS_SDK_VERSION% for Python %MAJOR_PYTHON_VERSION% on a 64 bit architecture SET DISTUTILS_USE_SDK=1 SET MSSdk=1 "%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Setup\WindowsSdkVer.exe" -q -version:%WINDOWS_SDK_VERSION% "%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Bin\SetEnv.cmd" /x64 /release ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) ELSE ( ECHO Using default MSVC build environment for 64 bit architecture ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) ) ELSE ( ECHO Using default MSVC build environment for 32 bit architecture ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) scikit-learn-0.19.1/build_tools/circle/000077500000000000000000000000001317344356400177615ustar00rootroot00000000000000scikit-learn-0.19.1/build_tools/circle/build_doc.sh000077500000000000000000000104671317344356400222540ustar00rootroot00000000000000#!/usr/bin/env bash set -x set -e # Decide what kind of documentation build to run, and run it. # # If the last commit message has a "[doc skip]" marker, do not build # the doc. On the contrary if a "[doc build]" marker is found, build the doc # instead of relying on the subsequent rules. # # We always build the documentation for jobs that are not related to a specific # PR (e.g. a merge to master or a maintenance branch). # # If this is a PR, do a full build if there are some files in this PR that are # under the "doc/" or "examples/" folders, otherwise perform a quick build. # # If the inspection of the current commit fails for any reason, the default # behavior is to quick build the documentation. get_build_type() { if [ -z "$CIRCLE_SHA1" ] then echo SKIP: undefined CIRCLE_SHA1 return fi commit_msg=$(git log --format=%B -n 1 $CIRCLE_SHA1) if [ -z "$commit_msg" ] then echo QUICK BUILD: failed to inspect commit $CIRCLE_SHA1 return fi if [[ "$commit_msg" =~ \[doc\ skip\] ]] then echo SKIP: [doc skip] marker found return fi if [[ "$commit_msg" =~ \[doc\ quick\] ]] then echo QUICK: [doc quick] marker found return fi if [[ "$commit_msg" =~ \[doc\ build\] ]] then echo BUILD: [doc build] marker found return fi if [ -z "$CI_PULL_REQUEST" ] then echo BUILD: not a pull request return fi git_range="origin/master...$CIRCLE_SHA1" git fetch origin master >&2 || (echo QUICK BUILD: failed to get changed filenames for $git_range; return) filenames=$(git diff --name-only $git_range) if [ -z "$filenames" ] then echo QUICK BUILD: no changed filenames for $git_range return fi if echo "$filenames" | grep -q -e ^examples/ then echo BUILD: detected examples/ filename modified in $git_range: $(echo "$filenames" | grep -e ^examples/ | head -n1) return fi echo QUICK BUILD: no examples/ filename modified in $git_range: echo "$filenames" } build_type=$(get_build_type) if [[ "$build_type" =~ ^SKIP ]] then exit 0 fi if [[ "$CIRCLE_BRANCH" =~ ^master$|^[0-9]+\.[0-9]+\.X$ && -z "$CI_PULL_REQUEST" ]] then # PDF linked into HTML MAKE_TARGET="dist LATEXMKOPTS=-halt-on-error" elif [[ "$build_type" =~ ^QUICK ]] then MAKE_TARGET=html-noplot else MAKE_TARGET=html fi # Installing required system packages to support the rendering of math # notation in the HTML documentation sudo -E apt-get -yq update sudo -E apt-get -yq remove texlive-binaries --purge sudo -E apt-get -yq --no-install-suggests --no-install-recommends --force-yes \ install dvipng texlive-latex-base texlive-latex-extra \ texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended\ latexmk # deactivate circleci virtualenv and setup a miniconda env instead if [[ `type -t deactivate` ]]; then deactivate fi # Install dependencies with miniconda wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \ -O miniconda.sh chmod +x miniconda.sh && ./miniconda.sh -b -p $MINICONDA_PATH export PATH="$MINICONDA_PATH/bin:$PATH" # Temporary work-around (2017-09-27) # conda update --yes --quiet conda # Configure the conda environment and put it in the path using the # provided versions conda create -n $CONDA_ENV_NAME --yes --quiet python numpy scipy \ cython nose coverage matplotlib sphinx=1.6.2 pillow source activate testenv # Build and install scikit-learn in dev mode python setup.py develop # The pipefail is requested to propagate exit code set -o pipefail && cd doc && make $MAKE_TARGET 2>&1 | tee ~/log.txt cd - set +o pipefail affected_doc_paths() { files=$(git diff --name-only origin/master...$CIRCLE_SHA1) echo "$files" | grep ^doc/.*\.rst | sed 's/^doc\/\(.*\)\.rst$/\1.html/' echo "$files" | grep ^examples/.*.py | sed 's/^\(.*\)\.py$/auto_\1.html/' sklearn_files=$(echo "$files" | grep '^sklearn/') if [ -n "$sklearn_files" ] then grep -hlR -f<(echo "$sklearn_files" | sed 's/^/scikit-learn\/blob\/[a-z0-9]*\//') doc/_build/html/stable/modules/generated | cut -d/ -f5- fi } if [ -n "$CI_PULL_REQUEST" ] then echo "The following documentation files may have been changed by PR #$CI_PULL_REQUEST:" affected=$(affected_doc_paths) echo "$affected" | sed 's|^|* http://scikit-learn.org/circle?'$CIRCLE_BUILD_NUM'/|' ( echo '
    ' echo "$affected" | sed 's|.*|
  • &
  • |' echo '
' ) > 'doc/_build/html/stable/_changed.html' fi scikit-learn-0.19.1/build_tools/circle/checkout_merge_commit.sh000077500000000000000000000016421317344356400246570ustar00rootroot00000000000000#!/bin/bash # Add `master` branch to the update list. # Otherwise CircleCI will give us a cached one. FETCH_REFS="+master:master" # Update PR refs for testing. if [[ -n "${CIRCLE_PR_NUMBER}" ]] then FETCH_REFS="${FETCH_REFS} +refs/pull/${CIRCLE_PR_NUMBER}/head:pr/${CIRCLE_PR_NUMBER}/head" FETCH_REFS="${FETCH_REFS} +refs/pull/${CIRCLE_PR_NUMBER}/merge:pr/${CIRCLE_PR_NUMBER}/merge" fi # Retrieve the refs. git fetch -u origin ${FETCH_REFS} # Checkout the PR merge ref. if [[ -n "${CIRCLE_PR_NUMBER}" ]] then git checkout -qf "pr/${CIRCLE_PR_NUMBER}/merge" || ( echo Could not fetch merge commit. >&2 echo There may be conflicts in merging PR \#${CIRCLE_PR_NUMBER} with master. >&2; exit 1) fi # Check for merge conflicts. if [[ -n "${CIRCLE_PR_NUMBER}" ]] then git branch --merged | grep master > /dev/null git branch --merged | grep "pr/${CIRCLE_PR_NUMBER}/head" > /dev/null fi scikit-learn-0.19.1/build_tools/circle/push_doc.sh000077500000000000000000000022371317344356400221300ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called in the "deploy" step defined in # circle.yml. See https://circleci.com/docs/ for more details. # The behavior of the script is controlled by environment variable defined # in the circle.yml in the top level folder of the project. if [ -z $CIRCLE_PROJECT_USERNAME ]; then USERNAME="sklearn-ci"; else USERNAME=$CIRCLE_PROJECT_USERNAME; fi DOC_REPO="scikit-learn.github.io" if [ "$CIRCLE_BRANCH" = "master" ] then dir=dev else # Strip off .X dir="${CIRCLE_BRANCH::-2}" fi MSG="Pushing the docs to $dir/ for branch: $CIRCLE_BRANCH, commit $CIRCLE_SHA1" cd $HOME if [ ! -d $DOC_REPO ]; then git clone --depth 1 --no-checkout "git@github.com:scikit-learn/"$DOC_REPO".git"; fi cd $DOC_REPO git config core.sparseCheckout true echo $dir > .git/info/sparse-checkout git checkout $CIRCLE_BRANCH git reset --hard origin/$CIRCLE_BRANCH git rm -rf $dir/ && rm -rf $dir/ cp -R $HOME/scikit-learn/doc/_build/html/stable $dir git config --global user.email "olivier.grisel+sklearn-ci@gmail.com" git config --global user.name $USERNAME git config --global push.default matching git add -f $dir/ git commit -m "$MSG" $dir git push echo $MSG scikit-learn-0.19.1/build_tools/travis/000077500000000000000000000000001317344356400200305ustar00rootroot00000000000000scikit-learn-0.19.1/build_tools/travis/after_success.sh000077500000000000000000000012321317344356400232160ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "after_success" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # License: 3-clause BSD set -e if [[ "$COVERAGE" == "true" ]]; then # Need to run codecov from a git checkout, so we copy .coverage # from TEST_DIR where nosetests has been run cp $TEST_DIR/.coverage $TRAVIS_BUILD_DIR cd $TRAVIS_BUILD_DIR # Ignore codecov failures as the codecov server is not # very reliable but we don't want travis to report a failure # in the github UI just because the coverage report failed to # be published. codecov || echo "codecov upload failed" fi scikit-learn-0.19.1/build_tools/travis/flake8_diff.sh000077500000000000000000000135311317344356400225340ustar00rootroot00000000000000#!/bin/bash # This script is used in Travis to check that PRs do not add obvious # flake8 violations. It relies on two things: # - find common ancestor between branch and # scikit-learn/scikit-learn remote # - run flake8 --diff on the diff between the branch and the common # ancestor # # Additional features: # - the line numbers in Travis match the local branch on the PR # author machine. # - ./build_tools/travis/flake8_diff.sh can be run locally for quick # turn-around set -e # pipefail is necessary to propagate exit codes set -o pipefail PROJECT=scikit-learn/scikit-learn PROJECT_URL=https://github.com/$PROJECT.git # Find the remote with the project name (upstream in most cases) REMOTE=$(git remote -v | grep $PROJECT | cut -f1 | head -1 || echo '') # Add a temporary remote if needed. For example this is necessary when # Travis is configured to run in a fork. In this case 'origin' is the # fork and not the reference repo we want to diff against. if [[ -z "$REMOTE" ]]; then TMP_REMOTE=tmp_reference_upstream REMOTE=$TMP_REMOTE git remote add $REMOTE $PROJECT_URL fi echo "Remotes:" echo '--------------------------------------------------------------------------------' git remote --verbose # Travis does the git clone with a limited depth (50 at the time of # writing). This may not be enough to find the common ancestor with # $REMOTE/master so we unshallow the git checkout if [[ -a .git/shallow ]]; then echo -e '\nTrying to unshallow the repo:' echo '--------------------------------------------------------------------------------' git fetch --unshallow fi if [[ "$TRAVIS" == "true" ]]; then if [[ "$TRAVIS_PULL_REQUEST" == "false" ]] then # In main repo, using TRAVIS_COMMIT_RANGE to test the commits # that were pushed into a branch if [[ "$PROJECT" == "$TRAVIS_REPO_SLUG" ]]; then if [[ -z "$TRAVIS_COMMIT_RANGE" ]]; then echo "New branch, no commit range from Travis so passing this test by convention" exit 0 fi COMMIT_RANGE=$TRAVIS_COMMIT_RANGE fi else # We want to fetch the code as it is in the PR branch and not # the result of the merge into master. This way line numbers # reported by Travis will match with the local code. LOCAL_BRANCH_REF=travis_pr_$TRAVIS_PULL_REQUEST # In Travis the PR target is always origin git fetch origin pull/$TRAVIS_PULL_REQUEST/head:refs/$LOCAL_BRANCH_REF fi fi # If not using the commit range from Travis we need to find the common # ancestor between $LOCAL_BRANCH_REF and $REMOTE/master if [[ -z "$COMMIT_RANGE" ]]; then if [[ -z "$LOCAL_BRANCH_REF" ]]; then LOCAL_BRANCH_REF=$(git rev-parse --abbrev-ref HEAD) fi echo -e "\nLast 2 commits in $LOCAL_BRANCH_REF:" echo '--------------------------------------------------------------------------------' git --no-pager log -2 $LOCAL_BRANCH_REF REMOTE_MASTER_REF="$REMOTE/master" # Make sure that $REMOTE_MASTER_REF is a valid reference echo -e "\nFetching $REMOTE_MASTER_REF" echo '--------------------------------------------------------------------------------' git fetch $REMOTE master:refs/remotes/$REMOTE_MASTER_REF LOCAL_BRANCH_SHORT_HASH=$(git rev-parse --short $LOCAL_BRANCH_REF) REMOTE_MASTER_SHORT_HASH=$(git rev-parse --short $REMOTE_MASTER_REF) COMMIT=$(git merge-base $LOCAL_BRANCH_REF $REMOTE_MASTER_REF) || \ echo "No common ancestor found for $(git show $LOCAL_BRANCH_REF -q) and $(git show $REMOTE_MASTER_REF -q)" if [ -z "$COMMIT" ]; then exit 1 fi COMMIT_SHORT_HASH=$(git rev-parse --short $COMMIT) echo -e "\nCommon ancestor between $LOCAL_BRANCH_REF ($LOCAL_BRANCH_SHORT_HASH)"\ "and $REMOTE_MASTER_REF ($REMOTE_MASTER_SHORT_HASH) is $COMMIT_SHORT_HASH:" echo '--------------------------------------------------------------------------------' git --no-pager show --no-patch $COMMIT_SHORT_HASH COMMIT_RANGE="$COMMIT_SHORT_HASH..$LOCAL_BRANCH_SHORT_HASH" if [[ -n "$TMP_REMOTE" ]]; then git remote remove $TMP_REMOTE fi else echo "Got the commit range from Travis: $COMMIT_RANGE" fi echo -e '\nRunning flake8 on the diff in the range' "$COMMIT_RANGE" \ "($(git rev-list $COMMIT_RANGE | wc -l) commit(s)):" echo '--------------------------------------------------------------------------------' # We ignore files from sklearn/externals. Unfortunately there is no # way to do it with flake8 directly (the --exclude does not seem to # work with --diff). We could use the exclude magic in the git pathspec # ':!sklearn/externals' but it is only available on git 1.9 and Travis # uses git 1.8. # We need the following command to exit with 0 hence the echo in case # there is no match MODIFIED_FILES="$(git diff --name-only $COMMIT_RANGE | grep -v 'sklearn/externals' | \ grep -v 'doc/sphinxext/sphinx_gallery' || echo "no_match")" check_files() { files="$1" shift options="$*" if [ -n "$files" ]; then # Conservative approach: diff without context (--unified=0) so that code # that was not changed does not create failures git diff --unified=0 $COMMIT_RANGE -- $files | flake8 --diff --show-source $options fi } if [[ "$MODIFIED_FILES" == "no_match" ]]; then echo "No file outside sklearn/externals and doc/sphinxext/sphinx_gallery has been modified" else # Default ignore PEP8 violations are from flake8 3.3.0 DEFAULT_IGNORED_PEP8=E121,E123,E126,E226,E24,E704,W503,W504 check_files "$(echo "$MODIFIED_FILES" | grep -v ^examples)" \ --ignore $DEFAULT_IGNORED_PEP8 # Examples are allowed to not have imports at top of file check_files "$(echo "$MODIFIED_FILES" | grep ^examples)" \ --ignore $DEFAULT_IGNORED_PEP8 --ignore E402 fi echo -e "No problem detected by flake8\n" scikit-learn-0.19.1/build_tools/travis/install.sh000077500000000000000000000102531317344356400220360ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "install" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # The behavior of the script is controlled by environment variabled defined # in the .travis.yml in the top level folder of the project. # License: 3-clause BSD # Travis clone scikit-learn/scikit-learn repository in to a local repository. # We use a cached directory with three scikit-learn repositories (one for each # matrix entry) from which we pull from local Travis repository. This allows # us to keep build artefact for gcc + cython, and gain time set -e echo 'List files from cached directories' echo 'pip:' ls $HOME/.cache/pip export CC=/usr/lib/ccache/gcc export CXX=/usr/lib/ccache/g++ # Useful for debugging how ccache is used # export CCACHE_LOGFILE=/tmp/ccache.log # ~60M is used by .ccache when compiling from scratch at the time of writing ccache --max-size 100M --show-stats if [[ "$DISTRIB" == "conda" ]]; then # Deactivate the travis-provided virtual environment and setup a # conda-based environment instead deactivate # Install miniconda wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh \ -O miniconda.sh MINICONDA_PATH=/home/travis/miniconda chmod +x miniconda.sh && ./miniconda.sh -b -p $MINICONDA_PATH export PATH=$MINICONDA_PATH/bin:$PATH # Temporary work-around (2017-09-27) # conda update --yes conda # Configure the conda environment and put it in the path using the # provided versions if [[ "$INSTALL_MKL" == "true" ]]; then conda create -n testenv --yes python=$PYTHON_VERSION pip nose pytest \ numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION \ mkl cython=$CYTHON_VERSION \ ${PANDAS_VERSION+pandas=$PANDAS_VERSION} else conda create -n testenv --yes python=$PYTHON_VERSION pip nose pytest \ numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION \ nomkl cython=$CYTHON_VERSION \ ${PANDAS_VERSION+pandas=$PANDAS_VERSION} fi source activate testenv # Install nose-timer via pip pip install nose-timer elif [[ "$DISTRIB" == "ubuntu" ]]; then # At the time of writing numpy 1.9.1 is included in the travis # virtualenv but we want to use the numpy installed through apt-get # install. deactivate # Create a new virtualenv using system site packages for python, numpy # and scipy virtualenv --system-site-packages testvenv source testvenv/bin/activate pip install nose nose-timer cython elif [[ "$DISTRIB" == "scipy-dev-wheels" ]]; then # Set up our own virtualenv environment to avoid travis' numpy. # This venv points to the python interpreter of the travis build # matrix. virtualenv --python=python ~/testvenv source ~/testvenv/bin/activate pip install --upgrade pip setuptools echo "Installing numpy and scipy master wheels" dev_url=https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com pip install --pre --upgrade --timeout=60 -f $dev_url numpy scipy pip install nose nose-timer cython fi if [[ "$COVERAGE" == "true" ]]; then pip install coverage codecov fi if [[ "$TEST_DOCSTRINGS" == "true" ]]; then pip install sphinx numpydoc # numpydoc requires sphinx fi if [[ "$SKIP_TESTS" == "true" ]]; then echo "No need to build scikit-learn when not running the tests" else # Build scikit-learn in the install.sh script to collapse the verbose # build output in the travis output when it succeeds. python --version python -c "import numpy; print('numpy %s' % numpy.__version__)" python -c "import scipy; print('scipy %s' % scipy.__version__)" python -c "\ try: import pandas print('pandas %s' % pandas.__version__) except ImportError: pass " python setup.py develop ccache --show-stats # Useful for debugging how ccache is used # cat $CCACHE_LOGFILE fi if [[ "$RUN_FLAKE8" == "true" ]]; then # flake8 version is temporarily set to 2.5.1 because the next # version available on conda (3.3.0) has a bug that checks non # python files and cause non meaningful flake8 errors conda install --yes flake8=2.5.1 fi scikit-learn-0.19.1/build_tools/travis/test_script.sh000077500000000000000000000034371317344356400227410ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "script" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # The behavior of the script is controlled by environment variabled defined # in the .travis.yml in the top level folder of the project. # License: 3-clause BSD set -e python --version python -c "import numpy; print('numpy %s' % numpy.__version__)" python -c "import scipy; print('scipy %s' % scipy.__version__)" python -c "\ try: import pandas print('pandas %s' % pandas.__version__) except ImportError: pass " python -c "import multiprocessing as mp; print('%d CPUs' % mp.cpu_count())" run_tests() { if [[ "$USE_PYTEST" == "true" ]]; then TEST_CMD="pytest --showlocals --durations=1 --pyargs" else TEST_CMD="nosetests --with-timer --timer-top-n 20" fi # Get into a temp directory to run test from the installed scikit-learn and # check if we do not leave artifacts mkdir -p $TEST_DIR # We need the setup.cfg for the nose settings cp setup.cfg $TEST_DIR cd $TEST_DIR # Skip tests that require large downloads over the network to save bandwidth # usage as travis workers are stateless and therefore traditional local # disk caching does not work. export SKLEARN_SKIP_NETWORK_TESTS=1 if [[ "$COVERAGE" == "true" ]]; then TEST_CMD="$TEST_CMD --with-coverage" fi $TEST_CMD sklearn # Going back to git checkout folder needed to test documentation cd $OLDPWD if [[ "$USE_PYTEST" == "true" ]]; then pytest $(find doc -name '*.rst' | sort) else # Makefile is using nose make test-doc fi } if [[ "$RUN_FLAKE8" == "true" ]]; then source build_tools/travis/flake8_diff.sh fi if [[ "$SKIP_TESTS" != "true" ]]; then run_tests fi scikit-learn-0.19.1/build_tools/windows/000077500000000000000000000000001317344356400202125ustar00rootroot00000000000000scikit-learn-0.19.1/build_tools/windows/windows_testing_downloader.ps1000066400000000000000000000231671317344356400263150ustar00rootroot00000000000000# Author: Kyle Kastner # License: BSD 3 clause # This script is a helper to download the base python, numpy, and scipy # packages from their respective websites. # To quickly execute the script, run the following Powershell command: # powershell.exe -ExecutionPolicy unrestricted "iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/continuous_integration/windows/windows_testing_downloader.ps1'))" # This is a stopgap solution to make Windows testing easier # until Windows CI issues are resolved. # Rackspace's default Windows VMs have several security features enabled by default. # The DisableInternetExplorerESC function disables a feature which # prevents any webpage from opening without explicit permission. # This is a default setting of Windows VMs on Rackspace, and makes it annoying to # download other packages to test! # Powershell scripts are also disabled by default. One must run the command: # set-executionpolicy unrestricted # from a Powershell terminal with administrator rights to enable scripts. # To start an administrator Powershell terminal, right click second icon from the left on Windows Server 2012's bottom taskbar. param ( [string]$python = "None", [string]$nogit = "False" ) function DisableInternetExplorerESC { # Disables InternetExplorerESC to enable easier manual downloads of testing packages. # http://stackoverflow.com/questions/9368305/disable-ie-security-on-windows-server-via-powershell $AdminKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A7-37EF-4b3f-8CFC-4F3A74704073}" $UserKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A8-37EF-4b3f-8CFC-4F3A74704073}" Set-ItemProperty -Path $AdminKey -Name "IsInstalled" -Value 0 Set-ItemProperty -Path $UserKey -Name "IsInstalled" -Value 0 Stop-Process -Name Explorer Write-Host "IE Enhanced Security Configuration (ESC) has been disabled." -ForegroundColor Green } function DownloadPackages ($package_dict, $append_string) { $webclient = New-Object System.Net.WebClient ForEach ($key in $package_dict.Keys) { $url = $package_dict[$key] $file = $key + $append_string if ($url -match "(\.*exe$)") { $file = $file + ".exe" } elseif ($url -match "(\.*msi$)") { $file = $file + ".msi" } else { $file = $file + ".py" } $basedir = $pwd.Path + "\" $filepath = $basedir + $file Write-Host "Downloading" $file "from" $url # Retry up to 5 times in case of network transient errors. $retry_attempts = 5 for($i=0; $i -lt $retry_attempts; $i++){ try{ $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } Write-Host "File saved at" $filepath } } function InstallPython($match_string) { $pkg_regex = "python" + $match_string + "*" $pkg = Get-ChildItem -Filter $pkg_regex -Name Invoke-Expression -Command "msiexec /qn /i $pkg" Write-Host "Installing Python" Start-Sleep 25 Write-Host "Python installation complete" } function InstallPip($match_string, $python_version) { $pkg_regex = "get-pip" + $match_string + "*" $py = $python_version -replace "\." $pkg = Get-ChildItem -Filter $pkg_regex -Name $python_path = "C:\Python" + $py + "\python.exe" Invoke-Expression -Command "$python_path $pkg" } function EnsurePip($python_version) { $py = $python_version -replace "\." $python_path = "C:\Python" + $py + "\python.exe" Invoke-Expression -Command "$python_path -m ensurepip" } function GetPythonHome($python_version) { $py = $python_version -replace "\." $pypath = "C:\Python" + $py + "\" return $pypath } function GetPipPath($python_version) { $py = $python_version -replace "\." $pypath = GetPythonHome $python_version if ($py.StartsWith("3")) { $pip = $pypath + "Scripts\pip3.exe" } else { $pip = $pypath + "Scripts\pip.exe" } return $pip } function PipInstall($pkg_name, $python_version, $extra_args) { $pip = GetPipPath $python_version Invoke-Expression -Command "$pip install $pkg_name" } function InstallNose($python_version) { PipInstall "nose" $python_version } function WheelInstall($name, $url, $python_version) { $pip = GetPipPath $python_version $args = "install --use-wheel --no-index" Invoke-Expression -Command "$pip $args $url $name" } function InstallWheel($python_version) { PipInstall "virtualenv" $python_version PipInstall "wheel" $python_version } function InstallNumpy($package_dict, $python_version) { #Don't pass name so we can use URL directly. WheelInstall "" $package_dict["numpy"] $python_version } function InstallScipy($package_dict, $python_version) { #Don't pass name so we can use URL directly. WheelInstall "" $package_dict["scipy"] $python_version } function InstallGit { $pkg_regex = "git*" $pkg = Get-ChildItem -Filter $pkg_regex -Name $pkg_cmd = $pwd.ToString() + "\" + $pkg + " /verysilent" Invoke-Expression -Command $pkg_cmd Write-Host "Installing Git" Start-Sleep 20 # Remove the installer - seems to cause weird issues with Git Bash Invoke-Expression -Command "rm git.exe" Write-Host "Git installation complete" } function ReadAndUpdateFromRegistry { # http://stackoverflow.com/questions/14381650/how-to-update-windows-powershell-session-environment-variables-from-registry foreach($level in "Machine","User") { [Environment]::GetEnvironmentVariables($level).GetEnumerator() | % { # For Path variables, append the new values, if they're not already in there if($_.Name -match 'Path$') { $_.Value = ($((Get-Content "Env:$($_.Name)") + ";$($_.Value)") -split ';' | Select -unique) -join ';' } $_ } | Set-Content -Path { "Env:$($_.Name)" } } } function UpdatePaths($python_version) { #This function makes local path updates required in order to install Python and supplementary packages in a single shell. $pypath = GetPythonHome $python_version $env:PATH = $env:PATH + ";" + $pypath $env:PYTHONPATH = $pypath + "DLLs;" + $pypath + "Lib;" + $pypath + "Lib\site-packages" $env:PYTHONHOME = $pypath Write-Host "PYTHONHOME temporarily set to" $env:PYTHONHOME Write-Host "PYTHONPATH temporarily set to" $env:PYTHONPATH Write-Host "PATH temporarily set to" $env:PATH } function Python27URLs { # Function returns a dictionary of packages to download for Python 2.7. $urls = @{ "python" = "https://www.python.org/ftp/python/2.7.7/python-2.7.7.msi" "numpy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/numpy-1.8.1-cp27-none-win32.whl" "scipy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/scipy-0.14.0-cp27-none-win32.whl" "get-pip" = "https://bootstrap.pypa.io/get-pip.py" } return $urls } function Python34URLs { # Function returns a dictionary of packages to download for Python 3.4. $urls = @{ "python" = "https://www.python.org/ftp/python/3.4.1/python-3.4.1.msi" "numpy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/numpy-1.8.1-cp34-none-win32.whl" "scipy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/scipy-0.14.0-cp34-none-win32.whl" } return $urls } function GitURLs { # Function returns a dictionary of packages to download for Git $urls = @{ "git" = "https://github.com/msysgit/msysgit/releases/download/Git-1.9.4-preview20140611/Git-1.9.4-preview20140611.exe" } return $urls } function main { $versions = @{ "2.7" = Python27URLs "3.4" = Python34URLs } if ($nogit -eq "False") { Write-Host "Downloading and installing Gitbash" $urls = GitURLs DownloadPackages $urls "" InstallGit ".exe" } if (($python -eq "None")) { Write-Host "Installing all supported python versions" Write-Host "Current versions supported are:" ForEach ($key in $versions.Keys) { Write-Host $key $all_python += @($key) } } elseif(!($versions.ContainsKey($python))) { Write-Host "Python version not recognized!" Write-Host "Pass python version with -python" Write-Host "Current versions supported are:" ForEach ($key in $versions.Keys) { Write-Host $key } return } else { $all_python += @($python) } ForEach ($py in $all_python) { Write-Host "Installing Python" $py DisableInternetExplorerESC $pystring = $py -replace "\." $pystring = "_py" + $pystring $package_dict = $versions[$py] # This will download the whl packages as well which is # clunky but makes configuration simpler. DownloadPackages $package_dict $pystring UpdatePaths $py InstallPython $pystring ReadAndUpdateFromRegistry if ($package_dict.ContainsKey("get-pip")) { InstallPip $pystring $py } else { EnsurePip $py } InstallNose $py InstallWheel $py # The installers below here use wheel packages. # Wheels were created from CGohlke's installers with # wheel convert # These are hosted in Rackspace Cloud Files. InstallNumpy $package_dict $py InstallScipy $package_dict $py } return } main scikit-learn-0.19.1/circle.yml000066400000000000000000000014341317344356400161670ustar00rootroot00000000000000checkout: post: - ./build_tools/circle/checkout_merge_commit.sh machine: environment: MINICONDA_PATH: $HOME/miniconda CONDA_ENV_NAME: testenv dependencies: cache_directories: - "~/scikit_learn_data" - "~/scikit-learn.github.io" # Check whether the doc build is required, install build dependencies and # run sphinx to build the doc. override: - ./build_tools/circle/build_doc.sh: timeout: 3600 # seconds test: override: - | export PATH="$MINICONDA_PATH/bin:$PATH" source activate $CONDA_ENV_NAME make test-sphinxext deployment: push: branch: /^master$|^[0-9]+\.[0-9]+\.X$/ commands: - bash build_tools/circle/push_doc.sh general: # Open the doc to the API artifacts: - "doc/_build/html" - "~/log.txt" scikit-learn-0.19.1/conftest.py000066400000000000000000000000001317344356400163660ustar00rootroot00000000000000scikit-learn-0.19.1/doc/000077500000000000000000000000001317344356400147465ustar00rootroot00000000000000scikit-learn-0.19.1/doc/Makefile000066400000000000000000000070661317344356400164170ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD ?= sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml pickle json latex latexpdf changes linkcheck doctest optipng all: html-noplot help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* -rm -rf auto_examples/ -rm -rf generated/* -rm -rf modules/generated/* html: # These two lines make the build a bit more lengthy, and the # the embedding of images more robust rm -rf $(BUILDDIR)/html/_images #rm -rf _build/doctrees/ $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable" html-noplot: $(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." make -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." download-data: python -c "from sklearn.datasets.lfw import check_fetch_lfw; check_fetch_lfw()" # Optimize PNG files. Needs OptiPNG. Change the -P argument to the number of # cores you have available, so -P 64 if you have a real computer ;) optipng: find _build auto_examples */generated -name '*.png' -print0 \ | xargs -0 -n 1 -P 4 optipng -o10 dist: html latexpdf cp _build/latex/user_guide.pdf _build/html/stable/_downloads/scikit-learn-docs.pdf scikit-learn-0.19.1/doc/README.md000066400000000000000000000020651317344356400162300ustar00rootroot00000000000000# Documentation for scikit-learn This section contains the full manual and web page as displayed in http://scikit-learn.org. To generate the full web page, including the example gallery (this might take a while): make html Or, if you'd rather not build the example gallery: make html-noplot That should create all the doc in directory _build/html To build the PDF manual, run make latexpdf The website is hosted at github and can be updated manually (for releases) by pushing to the https://github.com/scikit-learn/scikit-learn.github.io repository. It's recommended to run OptiPNG, before uploading the website. The PNG files generated by Matplotlib tend to be ~20% too big, and they're costing us bandwidth. You can run OptiPNG with:: make optipng # Development documentation automated build A Rackspace cloud server named 'docbuilder' is continuously building the master branch to update the http://scikit-learn.org/dev tree of the website. The configuration of this server is managed at: https://github.com/scikit-learn/sklearn-docbuilder scikit-learn-0.19.1/doc/about.rst000066400000000000000000000217541317344356400166230ustar00rootroot00000000000000About us ======== .. include:: ../AUTHORS.rst .. seealso:: :ref:`How you can contribute to the project ` .. _citing-scikit-learn: Citing scikit-learn ------------------- If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper: `Scikit-learn: Machine Learning in Python `_, Pedregosa *et al.*, JMLR 12, pp. 2825-2830, 2011. Bibtex entry:: @article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} } If you want to cite scikit-learn for its API or design, you may also want to consider the following paper: `API design for machine learning software: experiences from the scikit-learn project `_, Buitinck *et al.*, 2013. Bibtex entry:: @inproceedings{sklearn_api, author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and Fabian Pedregosa and Andreas Mueller and Olivier Grisel and Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort and Jaques Grobler and Robert Layton and Jake VanderPlas and Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux}, title = {{API} design for machine learning software: experiences from the scikit-learn project}, booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning}, year = {2013}, pages = {108--122}, } Artwork ------- High quality PNG and SVG logos are available in the `doc/logos/ `_ source directory. .. image:: images/scikit-learn-logo-notext.png :align: center Funding ------- `INRIA `_ actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler (2012-2013) and Olivier Grisel (2013-2017) to work on this project full-time. It also hosts coding sprints and other events. .. image:: images/inria-logo.jpg :width: 200pt :align: center :target: https://www.inria.fr `Paris-Saclay Center for Data Science `_ funded one year for a developer to work on the project full-time (2014-2015) and 50% of the time of Guillaume Lemaitre (2016-2017). .. image:: images/cds-logo.png :width: 200pt :align: center :target: http://www.datascience-paris-saclay.fr `NYU Moore-Sloan Data Science Environment `_ funded Andreas Mueller (2014-2016) to work on this project. The Moore-Sloan Data Science Environment also funds several students to work on the project part-time. .. image:: images/nyu_short_color.png :width: 200pt :align: center :target: http://cds.nyu.edu/mooresloan/ `Télécom Paristech `_ funded Manoj Kumar (2014), Tom Dupré la Tour (2015), Raghav RV (2015-2017), Thierry Guillemot (2016-2017) and Albert Thomas (2017) to work on scikit-learn. .. image:: themes/scikit-learn/static/img/telecom.png :width: 100pt :align: center :target: http://www.telecom-paristech.fr/ `Columbia University `_ funds Andreas Müller since 2016. .. image:: themes/scikit-learn/static/img/columbia.png :width: 100pt :align: center :target: http://www.columbia.edu/ Andreas Müller also received a grant to improve scikit-learn from the `Alfred P. Sloan Foundation `_ in 2017. .. image:: images/sloan_banner.png :width: 200pt :align: center :target: https://sloan.org/ `The University of Sydney `_ funds Joel Nothman since July 2017. .. image:: themes/scikit-learn/static/img/sydney-primary.jpeg :width: 200pt :align: center :target: http://www.sydney.edu.au/ The following students were sponsored by `Google `_ to work on scikit-learn through the `Google Summer of Code `_ program. - 2007 - David Cournapeau - 2011 - `Vlad Niculae`_ - 2012 - `Vlad Niculae`_, Immanuel Bayer. - 2013 - Kemal Eren, Nicolas Trésegnie - 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar. - 2015 - `Raghav RV `_, Wei Xue - 2016 - `Nelson Liu `_, `YenChen Lin `_ It also provided funding for sprints and events around scikit-learn. If you would like to participate in the next Google Summer of code program, please see `this page `_. The `NeuroDebian `_ project providing `Debian `_ packaging and contributions is supported by `Dr. James V. Haxby `_ (`Dartmouth College `_). The `PSF `_ helped find and manage funding for our 2011 Granada sprint. More information can be found `here `__ `tinyclues `_ funded the 2011 international Granada sprint. Donating to the project ~~~~~~~~~~~~~~~~~~~~~~~ If you are interested in donating to the project or to one of our code-sprints, you can use the *Paypal* button below or the `NumFOCUS Donations Page `_ (if you use the latter, please indicate that you are donating for the scikit-learn project). All donations will be handled by `NumFOCUS `_, a non-profit-organization which is managed by a board of `Scipy community members `_. NumFOCUS's mission is to foster scientific computing software, in particular in Python. As a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available while in compliance with tax regulations. The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as well as towards the organization budget of the project [#f1]_. .. raw :: html


.. rubric:: Notes .. [#f1] Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS, hosting or continuous integration services. The 2013 Paris international sprint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |center-div| |telecom| |tinyclues| |afpy| |FNRS| |end-div| .. |center-div| raw:: html
.. |telecom| image:: themes/scikit-learn/static/img/telecom.png :width: 120pt :target: http://www.telecom-paristech.fr/ .. |tinyclues| image:: https://www.tinyclues.com/web/wp-content/uploads/2016/06/Tinyclues-PNG-logo.png :width: 120pt :target: https://www.tinyclues.com/ .. |afpy| image:: https://www.afpy.org/logo.png :width: 120pt :target: https://www.afpy.org .. |SGR| image:: http://www.svi.cnrs-bellevue.fr/wikimedia/images/Logo_svi_inp.png :width: 120pt :target: http://www.svi.cnrs-bellevue.fr .. |FNRS| image:: http://www.fnrs.be/en/images/FRS-FNRS_rose_transp.png :width: 120pt :target: http://www.frs-fnrs.be/ .. figure:: images/dysco.png :width: 120pt :target: http://sites.uclouvain.be/dysco/ IAP VII/19 - DYSCO .. |end-div| raw:: html
*For more information on this sprint, see* `here `__ Infrastructure support ---------------------- - We would like to thank `Rackspace `_ for providing us with a free `Rackspace Cloud `_ account to automatically build the documentation and the example gallery from for the development version of scikit-learn using `this tool `_. - We would also like to thank `Shining Panda `_ for free CPU time on their Continuous Integration server. scikit-learn-0.19.1/doc/conf.py000066400000000000000000000231171317344356400162510ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # scikit-learn documentation build configuration file, created by # sphinx-quickstart on Fri Jan 8 09:13:42 2010. # # This file is execfile()d with the current directory set to its containing # dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. from __future__ import print_function import sys import os from sklearn.externals.six import u # If extensions (or modules to document with autodoc) are in another # directory, add these directories to sys.path here. If the directory # is relative to the documentation root, use os.path.abspath to make it # absolute, like shown here. sys.path.insert(0, os.path.abspath('sphinxext')) from github_link import make_linkcode_resolve import sphinx_gallery # -- General configuration --------------------------------------------------- # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'numpy_ext.numpydoc', 'sphinx.ext.linkcode', 'sphinx.ext.doctest', 'sphinx_gallery.gen_gallery', 'sphinx_issues', ] # pngmath / imgmath compatibility layer for different sphinx versions import sphinx from distutils.version import LooseVersion if LooseVersion(sphinx.__version__) < LooseVersion('1.4'): extensions.append('sphinx.ext.pngmath') else: extensions.append('sphinx.ext.imgmath') autodoc_default_flags = ['members', 'inherited-members'] # Add any paths that contain templates here, relative to this directory. templates_path = ['templates'] # generate autosummary even if no references autosummary_generate = True # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8' # Generate the plots for the gallery plot_gallery = True # The master toctree document. master_doc = 'index' # General information about the project. project = u('scikit-learn') copyright = u('2007 - 2017, scikit-learn developers (BSD License)') # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. import sklearn version = sklearn.__version__ # The full version, including alpha/beta/rc tags. release = sklearn.__version__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of documents that shouldn't be included in the build. #unused_docs = [] # List of directories, relative to source directory, that shouldn't be # searched for source files. exclude_trees = ['_build', 'templates', 'includes'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. add_function_parentheses = False # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. Major themes that come with # Sphinx are currently 'default' and 'sphinxdoc'. html_theme = 'scikit-learn' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. html_theme_options = {'oldversion': False, 'collapsiblesidebar': True, 'google_analytics': True, 'surveybanner': False, 'sprintbanner': True} # Add any paths that contain custom themes here, relative to this directory. html_theme_path = ['themes'] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. html_short_title = 'scikit-learn' # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = 'logos/scikit-learn-logo-small.png' # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. html_favicon = 'logos/favicon.ico' # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['images'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. html_domain_indices = False # If false, no index is generated. html_use_index = False # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = '' # Output file base name for HTML help builder. htmlhelp_basename = 'scikit-learndoc' # -- Options for LaTeX output ------------------------------------------------ # The paper size ('letter' or 'a4'). #latex_paper_size = 'letter' # The font size ('10pt', '11pt' or '12pt'). #latex_font_size = '10pt' # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass # [howto/manual]). latex_documents = [('index', 'user_guide.tex', u('scikit-learn user guide'), u('scikit-learn developers'), 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. latex_logo = "logos/scikit-learn-logo.png" # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # Additional stuff for the LaTeX preamble. latex_preamble = r""" \usepackage{amsmath}\usepackage{amsfonts}\usepackage{bm}\usepackage{morefloats} \usepackage{enumitem} \setlistdepth{10} """ # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. latex_domain_indices = False trim_doctests_flags = True sphinx_gallery_conf = { 'doc_module': 'sklearn', 'backreferences_dir': os.path.join('modules', 'generated'), 'reference_url': { 'sklearn': None, 'matplotlib': 'http://matplotlib.org', 'numpy': 'http://docs.scipy.org/doc/numpy-1.8.1', 'scipy': 'http://docs.scipy.org/doc/scipy-0.13.3/reference'} } # The following dictionary contains the information used to create the # thumbnails for the front page of the scikit-learn home page. # key: first image in set # values: (number of plot in set, height of thumbnail) carousel_thumbs = {'sphx_glr_plot_classifier_comparison_001.png': 600, 'sphx_glr_plot_outlier_detection_003.png': 372, 'sphx_glr_plot_gpr_co2_001.png': 350, 'sphx_glr_plot_adaboost_twoclass_001.png': 372, 'sphx_glr_plot_compare_methods_001.png': 349} def make_carousel_thumbs(app, exception): """produces the final resized carousel images""" if exception is not None: return print('Preparing carousel images') image_dir = os.path.join(app.builder.outdir, '_images') for glr_plot, max_width in carousel_thumbs.items(): image = os.path.join(image_dir, glr_plot) if os.path.exists(image): c_thumb = os.path.join(image_dir, glr_plot[:-4] + '_carousel.png') sphinx_gallery.gen_rst.scale_image(image, c_thumb, max_width, 190) # Config for sphinx_issues issues_uri = 'https://github.com/scikit-learn/scikit-learn/issues/{issue}' issues_github_path = 'scikit-learn/scikit-learn' issues_user_uri = 'https://github.com/{user}' def setup(app): # to hide/show the prompt in code examples: app.add_javascript('js/copybutton.js') app.connect('build-finished', make_carousel_thumbs) # The following is used by sphinx.ext.linkcode to provide links to github linkcode_resolve = make_linkcode_resolve('sklearn', u'https://github.com/scikit-learn/' 'scikit-learn/blob/{revision}/' '{package}/{path}#L{lineno}') scikit-learn-0.19.1/doc/data_transforms.rst000066400000000000000000000024111317344356400206650ustar00rootroot00000000000000.. include:: includes/big_toc_css.rst .. _data-transforms: Dataset transformations ----------------------- scikit-learn provides a library of transformers, which may clean (see :ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`) feature representations. Like other estimators, these are represented by classes with a ``fit`` method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a ``transform`` method which applies this transformation model to unseen data. ``fit_transform`` may be more convenient and efficient for modelling and transforming the training data simultaneously. Combining such transformers, either in parallel or series is covered in :ref:`combining_estimators`. :ref:`metrics` covers transforming feature spaces into affinity matrices, while :ref:`preprocessing_targets` considers transformations of the target space (e.g. categorical labels) for use in scikit-learn. .. toctree:: modules/pipeline modules/feature_extraction modules/preprocessing modules/unsupervised_reduction modules/random_projection modules/kernel_approximation modules/metrics modules/preprocessing_targets scikit-learn-0.19.1/doc/datasets/000077500000000000000000000000001317344356400165565ustar00rootroot00000000000000scikit-learn-0.19.1/doc/datasets/conftest.py000066400000000000000000000041151317344356400207560ustar00rootroot00000000000000from os.path import exists from os.path import join import numpy as np from sklearn.utils.testing import SkipTest from sklearn.utils.testing import check_skip_network from sklearn.datasets import get_data_home from sklearn.utils.testing import install_mldata_mock from sklearn.utils.testing import uninstall_mldata_mock def setup_labeled_faces(): data_home = get_data_home() if not exists(join(data_home, 'lfw_home')): raise SkipTest("Skipping dataset loading doctests") def setup_mldata(): # setup mock urllib2 module to avoid downloading from mldata.org install_mldata_mock({ 'mnist-original': { 'data': np.empty((70000, 784)), 'label': np.repeat(np.arange(10, dtype='d'), 7000), }, 'iris': { 'data': np.empty((150, 4)), }, 'datasets-uci-iris': { 'double0': np.empty((150, 4)), 'class': np.empty((150,)), }, }) def teardown_mldata(): uninstall_mldata_mock() def setup_rcv1(): check_skip_network() # skip the test in rcv1.rst if the dataset is not already loaded rcv1_dir = join(get_data_home(), "RCV1") if not exists(rcv1_dir): raise SkipTest("Download RCV1 dataset to run this test.") def setup_twenty_newsgroups(): data_home = get_data_home() if not exists(join(data_home, '20news_home')): raise SkipTest("Skipping dataset loading doctests") def setup_working_with_text_data(): check_skip_network() def pytest_runtest_setup(item): fname = item.fspath.strpath if fname.endswith('datasets/labeled_faces.rst'): setup_labeled_faces() elif fname.endswith('datasets/mldata.rst'): setup_mldata() elif fname.endswith('datasets/rcv1.rst'): setup_rcv1() elif fname.endswith('datasets/twenty_newsgroups.rst'): setup_twenty_newsgroups() elif fname.endswith('datasets/working_with_text_data.rst'): setup_working_with_text_data() def pytest_runtest_teardown(item): fname = item.fspath.strpath if fname.endswith('datasets/mldata.rst'): teardown_mldata() scikit-learn-0.19.1/doc/datasets/covtype.rst000066400000000000000000000014071317344356400210030ustar00rootroot00000000000000 .. _covtype: Forest covertypes ================= The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch's cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the `dataset's homepage `_. Some of the features are boolean indicators, while others are discrete or continuous measurements. :func:`sklearn.datasets.fetch_covtype` will load the covertype dataset; it returns a dictionary-like object with the feature matrix in the ``data`` member and the target values in ``target``. The dataset will be downloaded from the web if necessary. scikit-learn-0.19.1/doc/datasets/index.rst000066400000000000000000000301441317344356400204210ustar00rootroot00000000000000.. _datasets: ========================= Dataset loading utilities ========================= .. currentmodule:: sklearn.datasets The ``sklearn.datasets`` package embeds some small toy datasets as introduced in the :ref:`Getting Started ` section. To evaluate the impact of the scale of the dataset (``n_samples`` and ``n_features``) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that comes from the 'real world'. General dataset API =================== There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for sample images, which is described below in the :ref:`sample_images` section. The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple ``(X, y)`` consisting of a ``n_samples`` * ``n_features`` numpy array ``X`` and an array of length ``n_samples`` containing the targets ``y``. The toy datasets as well as the 'real world' datasets and the datasets fetched from mldata.org have more sophisticated structure. These functions return a dictionary-like object holding at least two items: an array of shape ``n_samples`` * ``n_features`` with key ``data`` (except for 20newsgroups) and a numpy array of length ``n_samples``, containing the target values, with key ``target``. The datasets also contain a description in ``DESCR`` and some contain ``feature_names`` and ``target_names``. See the dataset descriptions below for details. Toy datasets ============ scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_boston load_iris load_diabetes load_digits load_linnerud load_wine load_breast_cancer These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks. .. _sample_images: Sample images ============= The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors. Those image can be useful to test algorithms and pipeline on 2D data. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_sample_images load_sample_image .. image:: ../auto_examples/cluster/images/sphx_glr_plot_color_quantization_001.png :target: ../auto_examples/cluster/plot_color_quantization.html :scale: 30 :align: right .. warning:: The default coding of images is based on the ``uint8`` dtype to spare memory. Often machine learning algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use ``matplotlib.pyplpt.imshow`` don't forget to scale to the range 0 - 1 as done in the following example. .. topic:: Examples: * :ref:`sphx_glr_auto_examples_cluster_plot_color_quantization.py` .. _sample_generators: Sample generators ================= In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. Generators for classification and clustering -------------------------------------------- These generators produce a matrix of features and corresponding discrete targets. Single label ~~~~~~~~~~~~ Both :func:`make_blobs` and :func:`make_classification` create multiclass datasets by allocating each class one or more normally-distributed clusters of points. :func:`make_blobs` provides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering. :func:`make_classification` specialises in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space. :func:`make_gaussian_quantiles` divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres. :func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem. .. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_dataset_001.png :target: ../auto_examples/datasets/plot_random_dataset.html :scale: 50 :align: center :func:`make_circles` and :func:`make_moons` generate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation. produces Gaussian data with a spherical decision boundary for binary classification. Multilabel ~~~~~~~~~~ :func:`make_multilabel_classification` generates random samples with multiple labels, reflecting a bag of words drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson, with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications with respect to true bag-of-words mixtures include: * Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base distribution, and would be correlated. * For a document generated from multiple topics, all topics are weighted equally in generating its bag of words. * Documents without labels words at random, rather than from a base distribution. .. image:: ../auto_examples/datasets/images/sphx_glr_plot_random_multilabel_dataset_001.png :target: ../auto_examples/datasets/plot_random_multilabel_dataset.html :scale: 50 :align: center Biclustering ~~~~~~~~~~~~ .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_biclusters make_checkerboard Generators for regression ------------------------- :func:`make_regression` produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance). Other regression generators generate functions deterministically from randomized features. :func:`make_sparse_uncorrelated` produces a target as a linear combination of four features with fixed coefficients. Others encode explicitly non-linear relations: :func:`make_friedman1` is related by polynomial and sine transforms; :func:`make_friedman2` includes feature multiplication and reciprocation; and :func:`make_friedman3` is similar with an arctan transformation on the target. Generators for manifold learning -------------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_s_curve make_swiss_roll Generators for decomposition ---------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_low_rank_matrix make_sparse_coded_signal make_spd_matrix make_sparse_spd_matrix .. _libsvm_loader: Datasets in svmlight / libsvm format ==================================== scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line takes the form ``