pax_global_header00006660000000000000000000000064126170030750014513gustar00rootroot0000000000000052 comment=f9a215ca3aac945ed2aa290d6e50d913e33305fc scikit-learn-0.17.0/000077500000000000000000000000001261700307500141655ustar00rootroot00000000000000scikit-learn-0.17.0/.coveragerc000066400000000000000000000001761261700307500163120ustar00rootroot00000000000000[run] branch = True source = sklearn include = */sklearn/* omit = */sklearn/externals/* */benchmarks/* */setup.py scikit-learn-0.17.0/.gitattributes000066400000000000000000000020651261700307500170630ustar00rootroot00000000000000/sklearn/__check_build/_check_build.c -diff /sklearn/_isotonic.c -diff /sklearn/cluster/_dbscan_inner.cpp -diff /sklearn/cluster/_hierarchical.cpp -diff /sklearn/cluster/_k_means.c -diff /sklearn/datasets/_svmlight_format.c -diff /sklearn/decomposition/_online_lda.c -diff /sklearn/decomposition/cdnmf_fast.c -diff /sklearn/ensemble/_gradient_boosting.c -diff /sklearn/feature_extraction/_hashing.c -diff /sklearn/linear_model/cd_fast.c -diff /sklearn/linear_model/sgd_fast.c -diff /sklearn/linear_model/sag_fast.c -diff /sklearn/metrics/pairwise_fast.c -diff /sklearn/neighbors/ball_tree.c -diff /sklearn/neighbors/kd_tree.c -diff /sklearn/svm/liblinear.c -diff /sklearn/svm/libsvm.c -diff /sklearn/svm/libsvm_sparse.c -diff /sklearn/tree/_tree.c -diff /sklearn/tree/_utils.c -diff /sklearn/utils/arrayfuncs.c -diff /sklearn/utils/graph_shortest_path.c -diff /sklearn/utils/lgamma.c -diff /sklearn/utils/_logistic_sigmoid.c -diff /sklearn/utils/murmurhash.c -diff /sklearn/utils/seq_dataset.c -diff /sklearn/utils/sparsefuncs_fast.c -diff /sklearn/utils/weight_vector.c -diff scikit-learn-0.17.0/.gitignore000066400000000000000000000011421261700307500161530ustar00rootroot00000000000000*.pyc *.so *.pyd *~ .#* *.lprof *.swp *.swo .DS_Store build sklearn/datasets/__config__.py sklearn/**/*.html dist/ MANIFEST doc/_build/ doc/auto_examples/ doc/modules/generated/ doc/datasets/generated/ *.pdf pip-log.txt scikit_learn.egg-info/ .coverage coverage *.py,cover tags covtype.data.gz 20news-18828/ 20news-18828.tar.gz coverages.zip samples.zip doc/coverages.zip doc/samples.zip coverages samples doc/coverages doc/samples *.prof .tox/ .coverage lfw_preprocessed/ nips2010_pdf/ *.nt.bz2 *.tar.gz *.tgz examples/cluster/joblib reuters/ benchmarks/bench_covertype_data/ *.prefs .pydevproject .idea scikit-learn-0.17.0/.landscape.yml000066400000000000000000000001261261700307500167170ustar00rootroot00000000000000pylint: disable: - unpacking-non-sequence ignore-paths: - sklearn/externals scikit-learn-0.17.0/.mailmap000066400000000000000000000151041261700307500156070ustar00rootroot00000000000000Alexandre Gramfort Alexandre Gramfort Alexandre Gramfort Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Arnaud Joly Arnaud Joly Arnaud Joly Anne-Laure Fouque Ariel Rokem arokem Bala Subrahmanyam Varanasi Bertrand Thirion Brandyn A. White Brian Cheung Brian Cheung Brian Cheung Brian Holt Christian Osendorfer Clay Woolam Danny Sullivan Denis Engemann Denis Engemann Denis Engemann Denis Engemann Diego Molla DraXus draxus Edouard DUCHESNAY Edouard DUCHESNAY Edouard DUCHESNAY Emmanuelle Gouillart Emmanuelle Gouillart Eustache Diemert Fabian Pedregosa Fabian Pedregosa Fabian Pedregosa Federico Vaggi Federico Vaggi Gael Varoquaux Gael Varoquaux Gael Varoquaux Gilles Louppe Hamzeh Alsalhi <93hamsal@gmail.com> Harikrishnan S Hendrik Heuer Hrishikesh Huilgolkar Immanuel Bayer Jacob Schreiber Jacob Schreiber Jake VanderPlas Jake VanderPlas Jake VanderPlas James Bergstra Jaques Grobler Jan Schlüter Jean Kossaifi Jean Kossaifi Jean Kossaifi Joel Nothman Kyle Kastner Lars Buitinck Lars Buitinck Lars Buitinck Lars Buitinck Lars Buitinck Loic Esteve Manoj Kumar Matthieu Perrot Maheshakya Wijewardena Michael Bommarito Michael Eickenberg Michael Eickenberg Samuel Charron Sergio Medina Nelle Varoquaux Nelle Varoquaux Nelle Varoquaux Nicolas Pinto Noel Dawe Noel Dawe Olivier Grisel Olivier Grisel Olivier Hervieu Paul Butler Peter Prettenhofer Robert Layton Roman Sinayev Roman Sinayev Ronald Phlypo Satrajit Ghosh Shiqiao Du Shiqiao Du Thomas Unterthiner Tim Sheerman-Chase Vincent Dubourg Vincent Dubourg Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Schut Virgile Fritsch Virgile Fritsch Vlad Niculae Wei Li Wei Li X006 Xinfan Meng Yannick Schwartz Yannick Schwartz Yannick Schwartz scikit-learn-0.17.0/.travis.yml000066400000000000000000000027521261700307500163040ustar00rootroot00000000000000language: python # make it explicit that we favor the new container-based travis workers sudo: false addons: apt: packages: # Only used by the DISTRIB="ubuntu" setting - libatlas3gf-base - libatlas-dev - python-scipy env: matrix: # This environment tests that scikit-learn can be built against # versions of numpy, scipy with ATLAS that comes with Ubuntu Precise 12.04 - DISTRIB="ubuntu" PYTHON_VERSION="2.7" COVERAGE="true" # This environment tests the oldest supported anaconda env - DISTRIB="conda" PYTHON_VERSION="2.6" INSTALL_MKL="false" NUMPY_VERSION="1.6.2" SCIPY_VERSION="0.11.0" # This environment tests the newest supported anaconda env - DISTRIB="conda" PYTHON_VERSION="3.5" INSTALL_MKL="true" NUMPY_VERSION="1.10.1" SCIPY_VERSION="0.16.0" install: source continuous_integration/install.sh script: bash continuous_integration/test_script.sh after_success: # Ignore coveralls failures as the coveralls server is not very reliable # but we don't want travis to report a failure in the github UI just # because the coverage report failed to be published. - if [[ "$COVERAGE" == "true" ]]; then coveralls || echo "failed"; fi cache: apt notifications: webhooks: urls: - https://webhooks.gitter.im/e/4ffabb4df010b70cd624 on_success: change # options: [always|never|change] default: always on_failure: always # options: [always|never|change] default: always on_start: false # default: false scikit-learn-0.17.0/AUTHORS.rst000066400000000000000000000041301261700307500160420ustar00rootroot00000000000000.. -*- mode: rst -*- This is a community effort, and as such many people have contributed to it over the years. History ------- This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started work on this project as part of his thesis. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. Since then, several releases have appeared following a ~3 month cycle, and a thriving international community has been leading the development. People ------ .. hlist:: * David Cournapeau * Jarrod Millman * `Matthieu Brucher `_ * `Fabian Pedregosa `_ * `Gael Varoquaux `_ * `Jake VanderPlas `_ * `Alexandre Gramfort `_ * `Olivier Grisel `_ * Bertrand Thirion * Vincent Michel * Chris Filo Gorgolewski * `Angel Soler Gollonet `_ * `Yaroslav Halchenko `_ * Ron Weiss * `Virgile Fritsch `_ * `Mathieu Blondel `_ * `Peter Prettenhofer `_ * Vincent Dubourg * `Alexandre Passos `_ * `Vlad Niculae `_ * Edouard Duchesnay * Thouis (Ray) Jones * Lars Buitinck * Paolo Losi * Nelle Varoquaux * `Brian Holt `_ * Robert Layton * `Gilles Louppe `_ * `Andreas Müller `_ (release manager) * `Satra Ghosh `_ * `Wei Li `_ * `Arnaud Joly `_ * `Kemal Eren `_ * `Michael Becker `_ scikit-learn-0.17.0/CONTRIBUTING.md000066400000000000000000000124721261700307500164240ustar00rootroot00000000000000 Contributing code ================= **Note: This document is just to get started, visit [**Contributing page**](http://scikit-learn.org/stable/developers/index.html) for the full contributor's guide. Please be sure to read it carefully to make the code review process go as smoothly as possible and maximize the likelihood of your contribution being merged.** How to contribute ----------------- The preferred way to contribute to scikit-learn is to fork the [main repository](http://github.com/scikit-learn/scikit-learn/) on GitHub: 1. Fork the [project repository](http://github.com/scikit-learn/scikit-learn): click on the 'Fork' button near the top of the page. This creates a copy of the code under your account on the GitHub server. 2. Clone this copy to your local disk: $ git clone git@github.com:YourLogin/scikit-learn.git $ cd scikit-learn 3. Create a branch to hold your changes: $ git checkout -b my-feature and start making changes. Never work in the ``master`` branch! 4. Work on this copy on your computer using Git to do the version control. When you're done editing, do: $ git add modified_files $ git commit to record your changes in Git, then push them to GitHub with: $ git push -u origin my-feature Finally, go to the web page of your fork of the scikit-learn repo, and click 'Pull request' to send your changes to the maintainers for review. This will send an email to the committers. (If any of the above seems like magic to you, then look up the [Git documentation](http://git-scm.com/documentation) on the web.) It is recommended to check that your contribution complies with the following rules before submitting a pull request: - All public methods should have informative docstrings with sample usage presented as doctests when appropriate. - All other tests pass when everything is rebuilt from scratch. On Unix-like systems, check with (from the toplevel source folder): $ make - When adding additional functionality, provide at least one example script in the ``examples/`` folder. Have a look at other examples for reference. Examples should demonstrate why the new functionality is useful in practice and, if possible, compare it to other methods available in scikit-learn. - At least one paragraph of narrative documentation with links to references in the literature (with PDF links when possible) and the example. The documentation should also include expected time and space complexity of the algorithm and scalability, e.g. "this algorithm can scale to a large number of samples > 100000, but does not scale in dimensionality: n_features is expected to be lower than 100". You can also check for common programming errors with the following tools: - Code with good unittest coverage (at least 80%), check with: $ pip install nose coverage $ nosetests --with-coverage path/to/tests_for_package - No pyflakes warnings, check with: $ pip install pyflakes $ pyflakes path/to/module.py - No PEP8 warnings, check with: $ pip install pep8 $ pep8 path/to/module.py - AutoPEP8 can help you fix some of the easy redundant errors: $ pip install autopep8 $ autopep8 path/to/pep8.py Bonus points for contributions that include a performance analysis with a benchmark script and profiling output (please report on the mailing list or on the GitHub issue). Easy Issues ----------- A great way to start contributing to scikit-learn is to pick an item from the list of [Easy issues](https://github.com/scikit-learn/scikit-learn/issues?labels=Easy) in the issue tracker. Resolving these issues allow you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues. Documentation ------------- We are glad to accept any sort of documentation: function docstrings, reStructuredText documents (like this one), tutorials, etc. reStructuredText documents live in the source code repository under the doc/ directory. You can edit the documentation using any text editor and then generate the HTML output by typing ``make html`` from the doc/ directory. Alternatively, ``make`` can be used to quickly generate the documentation without the example gallery. The resulting HTML files will be placed in _build/html/ and are viewable in a web browser. See the README file in the doc/ directory for more information. For building the documentation, you will need [sphinx](http://sphinx.pocoo.org/), [matplotlib](http://matplotlib.sourceforge.net/), and [pillow](http://pillow.readthedocs.org/en/latest/). When you are writing documentation, it is important to keep a good compromise between mathematical and algorithmic details, and give intuition to the reader on what the algorithm does. It is best to always start with a small paragraph with a hand-waving explanation of what the method does to the data and a figure (coming from an example) illustrating it. Further Information ------------------- Visit the [Contributing Code](http://scikit-learn.org/stable/developers/index.html#coding-guidelines) section of the website for more information including conforming to the API spec and profiling contributed code. scikit-learn-0.17.0/COPYING000066400000000000000000000030271261700307500152220ustar00rootroot00000000000000New BSD License Copyright (c) 2007–2015 The scikit-learn developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. b. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. c. Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. scikit-learn-0.17.0/MANIFEST.in000066400000000000000000000003631261700307500157250ustar00rootroot00000000000000include *.rst recursive-include doc * recursive-include examples * recursive-include sklearn *.c *.h *.pyx *.pxd *.pxi recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt include COPYING include AUTHORS.rst include README.rst scikit-learn-0.17.0/Makefile000066400000000000000000000030641261700307500156300ustar00rootroot00000000000000# simple makefile to simplify repetetive build env management tasks under posix # caution: testing won't work on windows, see README PYTHON ?= python CYTHON ?= cython NOSETESTS ?= nosetests CTAGS ?= ctags # skip doctests on 32bit python BITS := $(shell python -c 'import struct; print(8 * struct.calcsize("P"))') ifeq ($(BITS),32) NOSETESTS:=$(NOSETESTS) -c setup32.cfg endif all: clean inplace test clean-ctags: rm -f tags clean: clean-ctags $(PYTHON) setup.py clean rm -rf dist in: inplace # just a shortcut inplace: # to avoid errors in 0.15 upgrade rm -f sklearn/utils/sparsefuncs*.so rm -f sklearn/utils/random*.so $(PYTHON) setup.py build_ext -i test-code: in $(NOSETESTS) -s -v sklearn test-sphinxext: $(NOSETESTS) -s -v doc/sphinxext/ test-doc: ifeq ($(BITS),64) $(NOSETESTS) -s -v doc/*.rst doc/modules/ doc/datasets/ \ doc/developers doc/tutorial/basic doc/tutorial/statistical_inference \ doc/tutorial/text_analytics endif test-coverage: rm -rf coverage .coverage $(NOSETESTS) -s -v --with-coverage sklearn test: test-code test-sphinxext test-doc trailing-spaces: find sklearn -name "*.py" -exec perl -pi -e 's/[ \t]*$$//' {} \; cython: find sklearn -name "*.pyx" -exec $(CYTHON) {} \; ctags: # make tags for symbol based navigation in emacs and vim # Install with: sudo apt-get install exuberant-ctags $(CTAGS) --python-kinds=-i -R sklearn doc: inplace $(MAKE) -C doc html doc-noplot: inplace $(MAKE) -C doc html-noplot code-analysis: flake8 sklearn | grep -v __init__ | grep -v external pylint -E -i y sklearn/ -d E1103,E0611,E1101 scikit-learn-0.17.0/README.rst000066400000000000000000000101751261700307500156600ustar00rootroot00000000000000.. -*- mode: rst -*- |Travis|_ |AppVeyor|_ |Coveralls|_ .. |Travis| image:: https://api.travis-ci.org/scikit-learn/scikit-learn.png?branch=master .. _Travis: https://travis-ci.org/scikit-learn/scikit-learn .. |AppVeyor| image:: https://ci.appveyor.com/api/projects/status/github/scikit-learn/scikit-learn?branch=master&svg=true .. _AppVeyor: https://ci.appveyor.com/project/sklearn-ci/scikit-learn/history .. |Coveralls| image:: https://coveralls.io/repos/scikit-learn/scikit-learn/badge.svg?branch=master .. _Coveralls: https://coveralls.io/r/scikit-learn/scikit-learn scikit-learn ============ scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the AUTHORS.rst file for a complete list of contributors. It is currently maintained by a team of volunteers. **Note** `scikit-learn` was previously referred to as `scikits.learn`. Important links =============== - Official source code repo: https://github.com/scikit-learn/scikit-learn - HTML documentation (stable release): http://scikit-learn.org - HTML documentation (development version): http://scikit-learn.org/dev/ - Download releases: http://sourceforge.net/projects/scikit-learn/files/ - Issue tracker: https://github.com/scikit-learn/scikit-learn/issues - Mailing list: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general - IRC channel: ``#scikit-learn`` at ``irc.freenode.net`` Dependencies ============ scikit-learn is tested to work under Python 2.6, Python 2.7, and Python 3.4. (using the same codebase thanks to an embedded copy of `six `_). It should also work with Python 3.3. The required dependencies to build the software are NumPy >= 1.6.1, SciPy >= 0.9 and a working C/C++ compiler. For running the examples Matplotlib >= 1.1.1 is required and for running the tests you need nose >= 1.1.2. This configuration matches the Ubuntu Precise 12.04 LTS release from April 2012. scikit-learn also uses CBLAS, the C interface to the Basic Linear Algebra Subprograms library. scikit-learn comes with a reference implementation, but the system CBLAS will be detected by the build system and used if present. CBLAS exists in many implementations; see `Linear algebra libraries `_ for known issues. Install ======= This package uses distutils, which is the default way of installing python modules. To install in your home directory, use:: python setup.py install --user To install for all users on Unix/Linux:: python setup.py build sudo python setup.py install For more detailed installation instructions, see the web page http://scikit-learn.org/stable/install.html Development =========== Code ---- GIT ~~~ You can check the latest sources with the command:: git clone https://github.com/scikit-learn/scikit-learn.git or if you have write privileges:: git clone git@github.com:scikit-learn/scikit-learn.git Contributing ~~~~~~~~~~~~ Quick tutorial on how to go about setting up your environment to contribute to scikit-learn: https://github.com/scikit-learn/scikit-learn/blob/master/CONTRIBUTING.md Before opening a Pull Request, have a look at the full Contributing page to make sure your code complies with our guidelines: http://scikit-learn.org/stable/developers/index.html Testing ------- After installation, you can launch the test suite from outside the source directory (you will need to have the ``nose`` package installed):: $ nosetests -v sklearn Under Windows, it is recommended to use the following command (adjust the path to the ``python.exe`` program) as using the ``nosetests.exe`` program can badly interact with tests that use ``multiprocessing``:: C:\Python34\python.exe -c "import nose; nose.main()" -v sklearn See the web page http://scikit-learn.org/stable/install.html#testing for more information. Random number generation can be controlled during testing by setting the ``SKLEARN_SEED`` environment variable. scikit-learn-0.17.0/appveyor.yml000066400000000000000000000063701261700307500165630ustar00rootroot00000000000000# AppVeyor.com is a Continuous Integration service to build and run tests under # Windows # https://ci.appveyor.com/project/sklearn-ci/scikit-learn environment: global: # SDK v7.0 MSVC Express 2008's SetEnv.cmd script will fail if the # /E:ON and /V:ON options are not enabled in the batch script intepreter # See: http://stackoverflow.com/a/13751649/163740 CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\continuous_integration\\appveyor\\run_with_env.cmd" WHEELHOUSE_UPLOADER_USERNAME: sklearn-appveyor WHEELHOUSE_UPLOADER_SECRET: secure: BQm8KfEj6v2Y+dQxb2syQvTFxDnHXvaNktkLcYSq7jfbTOO6eH9n09tfQzFUVcWZ # Make sure we don't download large datasets when running the test on # continuous integration platform SKLEARN_SKIP_NETWORK_TESTS: 1 matrix: - PYTHON: "C:\\Python27" PYTHON_VERSION: "2.7.8" PYTHON_ARCH: "32" - PYTHON: "C:\\Python27-x64" PYTHON_VERSION: "2.7.8" PYTHON_ARCH: "64" - PYTHON: "C:\\Python34" PYTHON_VERSION: "3.4.3" PYTHON_ARCH: "32" - PYTHON: "C:\\Python34-x64" PYTHON_VERSION: "3.4.3" PYTHON_ARCH: "64" - PYTHON: "C:\\Python35" PYTHON_VERSION: "3.5.0" PYTHON_ARCH: "32" - PYTHON: "C:\\Python35-x64" PYTHON_VERSION: "3.5.0" PYTHON_ARCH: "64" install: # Install Python (from the official .msi of http://python.org) and pip when # not already installed. - "powershell ./continuous_integration/appveyor/install.ps1" - "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%" # Check that we have the expected version and architecture for Python - "python --version" - "python -c \"import struct; print(struct.calcsize('P') * 8)\"" # Install the build and runtime dependencies of the project. - "%CMD_IN_ENV% pip install --timeout=60 --trusted-host 28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com -r continuous_integration/appveyor/requirements.txt" - "%CMD_IN_ENV% python setup.py bdist_wheel bdist_wininst -b doc/logos/scikit-learn-logo.bmp" - ps: "ls dist" # Install the genreated wheel package to test it - "pip install --pre --no-index --find-links dist/ scikit-learn" # Not a .NET project, we build scikit-learn in the install step instead build: false test_script: # Change to a non-source folder to make sure we run the tests on the # installed library. - "mkdir empty_folder" - "cd empty_folder" - "python -c \"import nose; nose.main()\" -s -v sklearn" # Move back to the project folder - "cd .." artifacts: # Archive the generated wheel package in the ci.appveyor.com build report. - path: dist\* on_success: # Upload the generated wheel package to Rackspace # On Windows, Apache Libcloud cannot find a standard CA cert bundle so we # disable the ssl checks. - "python -m wheelhouse_uploader upload --no-ssl-check --local-folder=dist sklearn-windows-wheels" notifications: - provider: Webhook url: https://webhooks.gitter.im/e/0dc8e57cd38105aeb1b4 on_build_success: false on_build_failure: True cache: # Use the appveyor cache to avoid re-downloading large archives such # the MKL numpy and scipy wheels mirrored on a rackspace cloud # container, speed up the appveyor jobs and reduce bandwidth # usage on our rackspace account. - '%APPDATA%\pip\Cache' scikit-learn-0.17.0/benchmarks/000077500000000000000000000000001261700307500163025ustar00rootroot00000000000000scikit-learn-0.17.0/benchmarks/bench_20newsgroups.py000066400000000000000000000067431261700307500224030ustar00rootroot00000000000000from __future__ import print_function, division from time import time import argparse import numpy as np from sklearn.dummy import DummyClassifier from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.metrics import accuracy_score from sklearn.utils.validation import check_array from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB ESTIMATORS = { "dummy": DummyClassifier(), "random_forest": RandomForestClassifier(n_estimators=100, max_features="sqrt", min_samples_split=10), "extra_trees": ExtraTreesClassifier(n_estimators=100, max_features="sqrt", min_samples_split=10), "logistic_regression": LogisticRegression(), "naive_bayes": MultinomialNB(), "adaboost": AdaBoostClassifier(n_estimators=10), } ############################################################################### # Data if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('-e', '--estimators', nargs="+", required=True, choices=ESTIMATORS) args = vars(parser.parse_args()) data_train = fetch_20newsgroups_vectorized(subset="train") data_test = fetch_20newsgroups_vectorized(subset="test") X_train = check_array(data_train.data, dtype=np.float32, accept_sparse="csc") X_test = check_array(data_test.data, dtype=np.float32, accept_sparse="csr") y_train = data_train.target y_test = data_test.target print("20 newsgroups") print("=============") print("X_train.shape = {0}".format(X_train.shape)) print("X_train.format = {0}".format(X_train.format)) print("X_train.dtype = {0}".format(X_train.dtype)) print("X_train density = {0}" "".format(X_train.nnz / np.product(X_train.shape))) print("y_train {0}".format(y_train.shape)) print("X_test {0}".format(X_test.shape)) print("X_test.format = {0}".format(X_test.format)) print("X_test.dtype = {0}".format(X_test.dtype)) print("y_test {0}".format(y_test.shape)) print() print("Classifier Training") print("===================") accuracy, train_time, test_time = {}, {}, {} for name in sorted(args["estimators"]): clf = ESTIMATORS[name] try: clf.set_params(random_state=0) except (TypeError, ValueError): pass print("Training %s ... " % name, end="") t0 = time() clf.fit(X_train, y_train) train_time[name] = time() - t0 t0 = time() y_pred = clf.predict(X_test) test_time[name] = time() - t0 accuracy[name] = accuracy_score(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print() print("%s %s %s %s" % ("Classifier ", "train-time", "test-time", "Accuracy")) print("-" * 44) for name in sorted(accuracy, key=accuracy.get): print("%s %s %s %s" % (name.ljust(16), ("%.4fs" % train_time[name]).center(10), ("%.4fs" % test_time[name]).center(10), ("%.4f" % accuracy[name]).center(10))) print() scikit-learn-0.17.0/benchmarks/bench_covertype.py000066400000000000000000000163251261700307500220420ustar00rootroot00000000000000""" =========================== Covertype dataset benchmark =========================== Benchmark stochastic gradient descent (SGD), Liblinear, and Naive Bayes, CART (decision tree), RandomForest and Extra-Trees on the forest covertype dataset of Blackard, Jock, and Dean [1]. The dataset comprises 581,012 samples. It is low dimensional with 54 features and a sparsity of approx. 23%. Here, we consider the task of predicting class 1 (spruce/fir). The classification performance of SGD is competitive with Liblinear while being two orders of magnitude faster to train:: [..] Classification performance: =========================== Classifier train-time test-time error-rate -------------------------------------------- liblinear 15.9744s 0.0705s 0.2305 GaussianNB 3.0666s 0.3884s 0.4841 SGD 1.0558s 0.1152s 0.2300 CART 79.4296s 0.0523s 0.0469 RandomForest 1190.1620s 0.5881s 0.0243 ExtraTrees 640.3194s 0.6495s 0.0198 The same task has been used in a number of papers including: * `"SVM Optimization: Inverse Dependence on Training Set Size" `_ S. Shalev-Shwartz, N. Srebro - In Proceedings of ICML '08. * `"Pegasos: Primal estimated sub-gradient solver for svm" `_ S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07. * `"Training Linear SVMs in Linear Time" `_ T. Joachims - In SIGKDD '06 [1] http://archive.ics.uci.edu/ml/datasets/Covertype """ from __future__ import division, print_function # Author: Peter Prettenhofer # Arnaud Joly # License: BSD 3 clause import os from time import time import argparse import numpy as np from sklearn.datasets import fetch_covtype, get_data_home from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier, LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.metrics import zero_one_loss from sklearn.externals.joblib import Memory from sklearn.utils import check_array # Memoize the data extraction and memory map the resulting # train / test splits in readonly mode memory = Memory(os.path.join(get_data_home(), 'covertype_benchmark_data'), mmap_mode='r') @memory.cache def load_data(dtype=np.float32, order='C', random_state=13): """Load the data, then cache and memmap the train/test split""" ###################################################################### ## Load dataset print("Loading dataset...") data = fetch_covtype(download_if_missing=True, shuffle=True, random_state=random_state) X = check_array(data['data'], dtype=dtype, order=order) y = (data['target'] != 1).astype(np.int) ## Create train-test split (as [Joachims, 2006]) print("Creating train-test split...") n_train = 522911 X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] ## Standardize first 10 features (the numerical ones) mean = X_train.mean(axis=0) std = X_train.std(axis=0) mean[10:] = 0.0 std[10:] = 1.0 X_train = (X_train - mean) / std X_test = (X_test - mean) / std return X_train, X_test, y_train, y_test ESTIMATORS = { 'GBRT': GradientBoostingClassifier(n_estimators=250), 'ExtraTrees': ExtraTreesClassifier(n_estimators=20), 'RandomForest': RandomForestClassifier(n_estimators=20), 'CART': DecisionTreeClassifier(min_samples_split=5), 'SGD': SGDClassifier(alpha=0.001, n_iter=2), 'GaussianNB': GaussianNB(), 'liblinear': LinearSVC(loss="l2", penalty="l2", C=1000, dual=False, tol=1e-3), 'SAG': LogisticRegression(solver='sag', max_iter=2, C=1000) } if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--classifiers', nargs="+", choices=ESTIMATORS, type=str, default=['liblinear', 'GaussianNB', 'SGD', 'CART'], help="list of classifiers to benchmark.") parser.add_argument('--n-jobs', nargs="?", default=1, type=int, help="Number of concurrently running workers for " "models that support parallelism.") parser.add_argument('--order', nargs="?", default="C", type=str, choices=["F", "C"], help="Allow to choose between fortran and C ordered " "data") parser.add_argument('--random-seed', nargs="?", default=13, type=int, help="Common seed used by random number generator.") args = vars(parser.parse_args()) print(__doc__) X_train, X_test, y_train, y_test = load_data( order=args["order"], random_state=args["random_seed"]) print("") print("Dataset statistics:") print("===================") print("%s %d" % ("number of features:".ljust(25), X_train.shape[1])) print("%s %d" % ("number of classes:".ljust(25), np.unique(y_train).size)) print("%s %s" % ("data type:".ljust(25), X_train.dtype)) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of train samples:".ljust(25), X_train.shape[0], np.sum(y_train == 1), np.sum(y_train == 0), int(X_train.nbytes / 1e6))) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of test samples:".ljust(25), X_test.shape[0], np.sum(y_test == 1), np.sum(y_test == 0), int(X_test.nbytes / 1e6))) print() print("Training Classifiers") print("====================") error, train_time, test_time = {}, {}, {} for name in sorted(args["classifiers"]): print("Training %s ... " % name, end="") estimator = ESTIMATORS[name] estimator_params = estimator.get_params() estimator.set_params(**{p: args["random_seed"] for p in estimator_params if p.endswith("random_state")}) if "n_jobs" in estimator_params: estimator.set_params(n_jobs=args["n_jobs"]) time_start = time() estimator.fit(X_train, y_train) train_time[name] = time() - time_start time_start = time() y_pred = estimator.predict(X_test) test_time[name] = time() - time_start error[name] = zero_one_loss(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print("%s %s %s %s" % ("Classifier ", "train-time", "test-time", "error-rate")) print("-" * 44) for name in sorted(args["classifiers"], key=error.get): print("%s %s %s %s" % (name.ljust(12), ("%.4fs" % train_time[name]).center(10), ("%.4fs" % test_time[name]).center(10), ("%.4f" % error[name]).center(10))) print() scikit-learn-0.17.0/benchmarks/bench_glm.py000066400000000000000000000027251261700307500206000ustar00rootroot00000000000000""" A comparison of different methods in GLM Data comes from a random square matrix. """ from datetime import datetime import numpy as np from sklearn import linear_model from sklearn.utils.bench import total_seconds if __name__ == '__main__': import pylab as pl n_iter = 40 time_ridge = np.empty(n_iter) time_ols = np.empty(n_iter) time_lasso = np.empty(n_iter) dimensions = 500 * np.arange(1, n_iter + 1) for i in range(n_iter): print('Iteration %s of %s' % (i, n_iter)) n_samples, n_features = 10 * i + 3, 10 * i + 3 X = np.random.randn(n_samples, n_features) Y = np.random.randn(n_samples) start = datetime.now() ridge = linear_model.Ridge(alpha=1.) ridge.fit(X, Y) time_ridge[i] = total_seconds(datetime.now() - start) start = datetime.now() ols = linear_model.LinearRegression() ols.fit(X, Y) time_ols[i] = total_seconds(datetime.now() - start) start = datetime.now() lasso = linear_model.LassoLars() lasso.fit(X, Y) time_lasso[i] = total_seconds(datetime.now() - start) pl.figure('scikit-learn GLM benchmark results') pl.xlabel('Dimensions') pl.ylabel('Time (s)') pl.plot(dimensions, time_ridge, color='r') pl.plot(dimensions, time_ols, color='g') pl.plot(dimensions, time_lasso, color='b') pl.legend(['Ridge', 'OLS', 'LassoLars'], loc='upper left') pl.axis('tight') pl.show() scikit-learn-0.17.0/benchmarks/bench_glmnet.py000066400000000000000000000074101261700307500213030ustar00rootroot00000000000000""" To run this, you'll need to have installed. * glmnet-python * scikit-learn (of course) Does two benchmarks First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import numpy as np import gc from time import time from sklearn.datasets.samples_generator import make_regression alpha = 0.1 # alpha = 0.01 def rmse(a, b): return np.sqrt(np.mean((a - b) ** 2)) def bench(factory, X, Y, X_test, Y_test, ref_coef): gc.collect() # start time tstart = time() clf = factory(alpha=alpha).fit(X, Y) delta = (time() - tstart) # stop time print("duration: %0.3fs" % delta) print("rmse: %f" % rmse(Y_test, clf.predict(X_test))) print("mean coef abs diff: %f" % abs(ref_coef - clf.coef_.ravel()).mean()) return delta if __name__ == '__main__': from glmnet.elastic_net import Lasso as GlmnetLasso from sklearn.linear_model import Lasso as ScikitLasso # Delayed import of pylab import pylab as pl scikit_results = [] glmnet_results = [] n = 20 step = 500 n_features = 1000 n_informative = n_features / 10 n_test_samples = 1000 for i in range(1, n + 1): print('==================') print('Iteration %s of %s' % (i, n)) print('==================') X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:(i * step)] Y = Y[:(i * step)] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) pl.clf() xx = range(0, n * step, step) pl.title('Lasso regression on sample dataset (%d features)' % n_features) pl.plot(xx, scikit_results, 'b-', label='scikit-learn') pl.plot(xx, glmnet_results, 'r-', label='glmnet') pl.legend() pl.xlabel('number of samples to classify') pl.ylabel('Time (s)') pl.show() # now do a benchmark where the number of points is fixed # and the variable is the number of features scikit_results = [] glmnet_results = [] n = 20 step = 100 n_samples = 500 for i in range(1, n + 1): print('==================') print('Iteration %02d of %02d' % (i, n)) print('==================') n_features = i * step n_informative = n_features / 10 X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:n_samples] Y = Y[:n_samples] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) xx = np.arange(100, 100 + n * step, step) pl.figure('scikit-learn vs. glmnet benchmark results') pl.title('Regression in high dimensional spaces (%d samples)' % n_samples) pl.plot(xx, scikit_results, 'b-', label='scikit-learn') pl.plot(xx, glmnet_results, 'r-', label='glmnet') pl.legend() pl.xlabel('number of features') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.17.0/benchmarks/bench_isotonic.py000066400000000000000000000057461261700307500216560ustar00rootroot00000000000000""" Benchmarks of isotonic regression performance. We generate a synthetic dataset of size 10^n, for n in [min, max], and examine the time taken to run isotonic regression over the dataset. The timings are then output to stdout, or visualized on a log-log scale with matplotlib. This alows the scaling of the algorithm with the problem size to be visualized and understood. """ from __future__ import print_function import numpy as np import gc from datetime import datetime from sklearn.isotonic import isotonic_regression from sklearn.utils.bench import total_seconds import matplotlib.pyplot as plt import argparse def generate_perturbed_logarithm_dataset(size): return np.random.randint(-50, 50, size=n) \ + 50. * np.log(1 + np.arange(n)) def generate_logistic_dataset(size): X = np.sort(np.random.normal(size=size)) return np.random.random(size=size) < 1.0 / (1.0 + np.exp(-X)) DATASET_GENERATORS = { 'perturbed_logarithm': generate_perturbed_logarithm_dataset, 'logistic': generate_logistic_dataset } def bench_isotonic_regression(Y): """ Runs a single iteration of isotonic regression on the input data, and reports the total time taken (in seconds). """ gc.collect() tstart = datetime.now() isotonic_regression(Y) delta = datetime.now() - tstart return total_seconds(delta) if __name__ == '__main__': parser = argparse.ArgumentParser( description="Isotonic Regression benchmark tool") parser.add_argument('--iterations', type=int, required=True, help="Number of iterations to average timings over " "for each problem size") parser.add_argument('--log_min_problem_size', type=int, required=True, help="Base 10 logarithm of the minimum problem size") parser.add_argument('--log_max_problem_size', type=int, required=True, help="Base 10 logarithm of the maximum problem size") parser.add_argument('--show_plot', action='store_true', help="Plot timing output with matplotlib") parser.add_argument('--dataset', choices=DATASET_GENERATORS.keys(), required=True) args = parser.parse_args() timings = [] for exponent in range(args.log_min_problem_size, args.log_max_problem_size): n = 10 ** exponent Y = DATASET_GENERATORS[args.dataset](n) time_per_iteration = \ [bench_isotonic_regression(Y) for i in range(args.iterations)] timing = (n, np.mean(time_per_iteration)) timings.append(timing) # If we're not plotting, dump the timing to stdout if not args.show_plot: print(n, np.mean(time_per_iteration)) if args.show_plot: plt.plot(*zip(*timings)) plt.title("Average time taken running isotonic regression") plt.xlabel('Number of observations') plt.ylabel('Time (s)') plt.axis('tight') plt.loglog() plt.show() scikit-learn-0.17.0/benchmarks/bench_lasso.py000066400000000000000000000063511261700307500211410ustar00rootroot00000000000000""" Benchmarks of Lasso vs LassoLars First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import gc from time import time import numpy as np from sklearn.datasets.samples_generator import make_regression def compute_bench(alpha, n_samples, n_features, precompute): lasso_results = [] lars_lasso_results = [] it = 0 for ns in n_samples: for nf in n_features: it += 1 print('==================') print('Iteration %s of %s' % (it, max(len(n_samples), len(n_features)))) print('==================') n_informative = nf // 10 X, Y, coef_ = make_regression(n_samples=ns, n_features=nf, n_informative=n_informative, noise=0.1, coef=True) X /= np.sqrt(np.sum(X ** 2, axis=0)) # Normalize data gc.collect() print("- benchmarking Lasso") clf = Lasso(alpha=alpha, fit_intercept=False, precompute=precompute) tstart = time() clf.fit(X, Y) lasso_results.append(time() - tstart) gc.collect() print("- benchmarking LassoLars") clf = LassoLars(alpha=alpha, fit_intercept=False, normalize=False, precompute=precompute) tstart = time() clf.fit(X, Y) lars_lasso_results.append(time() - tstart) return lasso_results, lars_lasso_results if __name__ == '__main__': from sklearn.linear_model import Lasso, LassoLars import pylab as pl alpha = 0.01 # regularization parameter n_features = 10 list_n_samples = np.linspace(100, 1000000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, list_n_samples, [n_features], precompute=True) pl.figure('scikit-learn LASSO benchmark results') pl.subplot(211) pl.plot(list_n_samples, lasso_results, 'b-', label='Lasso') pl.plot(list_n_samples, lars_lasso_results, 'r-', label='LassoLars') pl.title('precomputed Gram matrix, %d features, alpha=%s' % (n_features, alpha)) pl.legend(loc='upper left') pl.xlabel('number of samples') pl.ylabel('Time (s)') pl.axis('tight') n_samples = 2000 list_n_features = np.linspace(500, 3000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, [n_samples], list_n_features, precompute=False) pl.subplot(212) pl.plot(list_n_features, lasso_results, 'b-', label='Lasso') pl.plot(list_n_features, lars_lasso_results, 'r-', label='LassoLars') pl.title('%d samples, alpha=%s' % (n_samples, alpha)) pl.legend(loc='upper left') pl.xlabel('number of features') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.17.0/benchmarks/bench_mnist.py000066400000000000000000000137701261700307500211550ustar00rootroot00000000000000""" ======================= MNIST dataset benchmark ======================= Benchmark on the MNIST dataset. The dataset comprises 70,000 samples and 784 features. Here, we consider the task of predicting 10 classes - digits from 0 to 9 from their raw images. By contrast to the covertype dataset, the feature space is homogenous. Example of output : [..] Classification performance: =========================== Classifier train-time test-time error-rat ------------------------------------------------------------ Nystroem-SVM 105.07s 0.91s 0.0227 ExtraTrees 48.20s 1.22s 0.0288 RandomForest 47.17s 1.21s 0.0304 SampledRBF-SVM 140.45s 0.84s 0.0486 CART 22.84s 0.16s 0.1214 dummy 0.01s 0.02s 0.8973 """ from __future__ import division, print_function # Author: Issam H. Laradji # Arnaud Joly # License: BSD 3 clause import os from time import time import argparse import numpy as np from sklearn.datasets import fetch_mldata from sklearn.datasets import get_data_home from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.dummy import DummyClassifier from sklearn.externals.joblib import Memory from sklearn.kernel_approximation import Nystroem from sklearn.kernel_approximation import RBFSampler from sklearn.metrics import zero_one_loss from sklearn.pipeline import make_pipeline from sklearn.svm import LinearSVC from sklearn.tree import DecisionTreeClassifier from sklearn.utils import check_array from sklearn.linear_model import LogisticRegression # Memoize the data extraction and memory map the resulting # train / test splits in readonly mode memory = Memory(os.path.join(get_data_home(), 'mnist_benchmark_data'), mmap_mode='r') @memory.cache def load_data(dtype=np.float32, order='F'): """Load the data, then cache and memmap the train/test split""" ###################################################################### ## Load dataset print("Loading dataset...") data = fetch_mldata('MNIST original') X = check_array(data['data'], dtype=dtype, order=order) y = data["target"] # Normalize features X = X / 255 ## Create train-test split (as [Joachims, 2006]) print("Creating train-test split...") n_train = 60000 X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] return X_train, X_test, y_train, y_test ESTIMATORS = { "dummy": DummyClassifier(), 'CART': DecisionTreeClassifier(), 'ExtraTrees': ExtraTreesClassifier(n_estimators=100), 'RandomForest': RandomForestClassifier(n_estimators=100), 'Nystroem-SVM': make_pipeline(Nystroem(gamma=0.015, n_components=1000), LinearSVC(C=100)), 'SampledRBF-SVM': make_pipeline(RBFSampler(gamma=0.015, n_components=1000), LinearSVC(C=100)), 'LinearRegression-SAG': LogisticRegression(solver='sag', tol=1e-1, C=1e4) } if __name__ == "__main__": parser = argparse.ArgumentParser() parser.add_argument('--classifiers', nargs="+", choices=ESTIMATORS, type=str, default=['ExtraTrees', 'Nystroem-SVM'], help="list of classifiers to benchmark.") parser.add_argument('--n-jobs', nargs="?", default=1, type=int, help="Number of concurrently running workers for " "models that support parallelism.") parser.add_argument('--order', nargs="?", default="C", type=str, choices=["F", "C"], help="Allow to choose between fortran and C ordered " "data") parser.add_argument('--random-seed', nargs="?", default=0, type=int, help="Common seed used by random number generator.") args = vars(parser.parse_args()) print(__doc__) X_train, X_test, y_train, y_test = load_data(order=args["order"]) print("") print("Dataset statistics:") print("===================") print("%s %d" % ("number of features:".ljust(25), X_train.shape[1])) print("%s %d" % ("number of classes:".ljust(25), np.unique(y_train).size)) print("%s %s" % ("data type:".ljust(25), X_train.dtype)) print("%s %d (size=%dMB)" % ("number of train samples:".ljust(25), X_train.shape[0], int(X_train.nbytes / 1e6))) print("%s %d (size=%dMB)" % ("number of test samples:".ljust(25), X_test.shape[0], int(X_test.nbytes / 1e6))) print() print("Training Classifiers") print("====================") error, train_time, test_time = {}, {}, {} for name in sorted(args["classifiers"]): print("Training %s ... " % name, end="") estimator = ESTIMATORS[name] estimator_params = estimator.get_params() estimator.set_params(**{p: args["random_seed"] for p in estimator_params if p.endswith("random_state")}) if "n_jobs" in estimator_params: estimator.set_params(n_jobs=args["n_jobs"]) time_start = time() estimator.fit(X_train, y_train) train_time[name] = time() - time_start time_start = time() y_pred = estimator.predict(X_test) test_time[name] = time() - time_start error[name] = zero_one_loss(y_test, y_pred) print("done") print() print("Classification performance:") print("===========================") print("{0: <24} {1: >10} {2: >11} {3: >12}" "".format("Classifier ", "train-time", "test-time", "error-rate")) print("-" * 60) for name in sorted(args["classifiers"], key=error.get): print("{0: <23} {1: >10.2f}s {2: >10.2f}s {3: >12.4f}" "".format(name, train_time[name], test_time[name], error[name])) print() scikit-learn-0.17.0/benchmarks/bench_multilabel_metrics.py000077500000000000000000000157421261700307500237070ustar00rootroot00000000000000#!/usr/bin/env python """ A comparison of multilabel target formats and metrics over them """ from __future__ import division from __future__ import print_function from timeit import timeit from functools import partial import itertools import argparse import sys import matplotlib.pyplot as plt import scipy.sparse as sp import numpy as np from sklearn.datasets import make_multilabel_classification from sklearn.metrics import (f1_score, accuracy_score, hamming_loss, jaccard_similarity_score) from sklearn.utils.testing import ignore_warnings METRICS = { 'f1': partial(f1_score, average='micro'), 'f1-by-sample': partial(f1_score, average='samples'), 'accuracy': accuracy_score, 'hamming': hamming_loss, 'jaccard': jaccard_similarity_score, } FORMATS = { 'sequences': lambda y: [list(np.flatnonzero(s)) for s in y], 'dense': lambda y: y, 'csr': lambda y: sp.csr_matrix(y), 'csc': lambda y: sp.csc_matrix(y), } @ignore_warnings def benchmark(metrics=tuple(v for k, v in sorted(METRICS.items())), formats=tuple(v for k, v in sorted(FORMATS.items())), samples=1000, classes=4, density=.2, n_times=5): """Times metric calculations for a number of inputs Parameters ---------- metrics : array-like of callables (1d or 0d) The metric functions to time. formats : array-like of callables (1d or 0d) These may transform a dense indicator matrix into multilabel representation. samples : array-like of ints (1d or 0d) The number of samples to generate as input. classes : array-like of ints (1d or 0d) The number of classes in the input. density : array-like of ints (1d or 0d) The density of positive labels in the input. n_times : int Time calling the metric n_times times. Returns ------- array of floats shaped like (metrics, formats, samples, classes, density) Time in seconds. """ metrics = np.atleast_1d(metrics) samples = np.atleast_1d(samples) classes = np.atleast_1d(classes) density = np.atleast_1d(density) formats = np.atleast_1d(formats) out = np.zeros((len(metrics), len(formats), len(samples), len(classes), len(density)), dtype=float) it = itertools.product(samples, classes, density) for i, (s, c, d) in enumerate(it): _, y_true = make_multilabel_classification(n_samples=s, n_features=1, n_classes=c, n_labels=d * c, random_state=42) _, y_pred = make_multilabel_classification(n_samples=s, n_features=1, n_classes=c, n_labels=d * c, random_state=84) for j, f in enumerate(formats): f_true = f(y_true) f_pred = f(y_pred) for k, metric in enumerate(metrics): t = timeit(partial(metric, f_true, f_pred), number=n_times) out[k, j].flat[i] = t return out def _tabulate(results, metrics, formats): """Prints results by metric and format Uses the last ([-1]) value of other fields """ column_width = max(max(len(k) for k in formats) + 1, 8) first_width = max(len(k) for k in metrics) head_fmt = ('{:<{fw}s}' + '{:>{cw}s}' * len(formats)) row_fmt = ('{:<{fw}s}' + '{:>{cw}.3f}' * len(formats)) print(head_fmt.format('Metric', *formats, cw=column_width, fw=first_width)) for metric, row in zip(metrics, results[:, :, -1, -1, -1]): print(row_fmt.format(metric, *row, cw=column_width, fw=first_width)) def _plot(results, metrics, formats, title, x_ticks, x_label, format_markers=('x', '|', 'o', '+'), metric_colors=('c', 'm', 'y', 'k', 'g', 'r', 'b')): """ Plot the results by metric, format and some other variable given by x_label """ fig = plt.figure('scikit-learn multilabel metrics benchmarks') plt.title(title) ax = fig.add_subplot(111) for i, metric in enumerate(metrics): for j, format in enumerate(formats): ax.plot(x_ticks, results[i, j].flat, label='{}, {}'.format(metric, format), marker=format_markers[j], color=metric_colors[i % len(metric_colors)]) ax.set_xlabel(x_label) ax.set_ylabel('Time (s)') ax.legend() plt.show() if __name__ == "__main__": ap = argparse.ArgumentParser() ap.add_argument('metrics', nargs='*', default=sorted(METRICS), help='Specifies metrics to benchmark, defaults to all. ' 'Choices are: {}'.format(sorted(METRICS))) ap.add_argument('--formats', nargs='+', choices=sorted(FORMATS), help='Specifies multilabel formats to benchmark ' '(defaults to all).') ap.add_argument('--samples', type=int, default=1000, help='The number of samples to generate') ap.add_argument('--classes', type=int, default=10, help='The number of classes') ap.add_argument('--density', type=float, default=.2, help='The average density of labels per sample') ap.add_argument('--plot', choices=['classes', 'density', 'samples'], default=None, help='Plot time with respect to this parameter varying ' 'up to the specified value') ap.add_argument('--n-steps', default=10, type=int, help='Plot this many points for each metric') ap.add_argument('--n-times', default=5, type=int, help="Time performance over n_times trials") args = ap.parse_args() if args.plot is not None: max_val = getattr(args, args.plot) if args.plot in ('classes', 'samples'): min_val = 2 else: min_val = 0 steps = np.linspace(min_val, max_val, num=args.n_steps + 1)[1:] if args.plot in ('classes', 'samples'): steps = np.unique(np.round(steps).astype(int)) setattr(args, args.plot, steps) if args.metrics is None: args.metrics = sorted(METRICS) if args.formats is None: args.formats = sorted(FORMATS) results = benchmark([METRICS[k] for k in args.metrics], [FORMATS[k] for k in args.formats], args.samples, args.classes, args.density, args.n_times) _tabulate(results, args.metrics, args.formats) if args.plot is not None: print('Displaying plot', file=sys.stderr) title = ('Multilabel metrics with %s' % ', '.join('{0}={1}'.format(field, getattr(args, field)) for field in ['samples', 'classes', 'density'] if args.plot != field)) _plot(results, args.metrics, args.formats, title, steps, args.plot) scikit-learn-0.17.0/benchmarks/bench_plot_approximate_neighbors.py000066400000000000000000000135731261700307500254530ustar00rootroot00000000000000""" Benchmark for approximate nearest neighbor search using locality sensitive hashing forest. There are two types of benchmarks. First, accuracy of LSHForest queries are measured for various hyper-parameters and index sizes. Second, speed up of LSHForest queries compared to brute force method in exact nearest neighbors is measures for the aforementioned settings. In general, speed up is increasing as the index size grows. """ from __future__ import division import numpy as np from tempfile import gettempdir from time import time from sklearn.neighbors import NearestNeighbors from sklearn.neighbors.approximate import LSHForest from sklearn.datasets import make_blobs from sklearn.externals.joblib import Memory m = Memory(cachedir=gettempdir()) @m.cache() def make_data(n_samples, n_features, n_queries, random_state=0): """Create index and query data.""" print('Generating random blob-ish data') X, _ = make_blobs(n_samples=n_samples + n_queries, n_features=n_features, centers=100, shuffle=True, random_state=random_state) # Keep the last samples as held out query vectors: note since we used # shuffle=True we have ensured that index and query vectors are # samples from the same distribution (a mixture of 100 gaussians in this # case) return X[:n_samples], X[n_samples:] def calc_exact_neighbors(X, queries, n_queries, n_neighbors): """Measures average times for exact neighbor queries.""" print ('Building NearestNeighbors for %d samples in %d dimensions' % (X.shape[0], X.shape[1])) nbrs = NearestNeighbors(algorithm='brute', metric='cosine').fit(X) average_time = 0 t0 = time() neighbors = nbrs.kneighbors(queries, n_neighbors=n_neighbors, return_distance=False) average_time = (time() - t0) / n_queries return neighbors, average_time def calc_accuracy(X, queries, n_queries, n_neighbors, exact_neighbors, average_time_exact, **lshf_params): """Calculates accuracy and the speed up of LSHForest.""" print('Building LSHForest for %d samples in %d dimensions' % (X.shape[0], X.shape[1])) lshf = LSHForest(**lshf_params) t0 = time() lshf.fit(X) lshf_build_time = time() - t0 print('Done in %0.3fs' % lshf_build_time) accuracy = 0 t0 = time() approx_neighbors = lshf.kneighbors(queries, n_neighbors=n_neighbors, return_distance=False) average_time_approx = (time() - t0) / n_queries for i in range(len(queries)): accuracy += np.in1d(approx_neighbors[i], exact_neighbors[i]).mean() accuracy /= n_queries speed_up = average_time_exact / average_time_approx print('Average time for lshf neighbor queries: %0.3fs' % average_time_approx) print ('Average time for exact neighbor queries: %0.3fs' % average_time_exact) print ('Average Accuracy : %0.2f' % accuracy) print ('Speed up: %0.1fx' % speed_up) return speed_up, accuracy if __name__ == '__main__': import matplotlib.pyplot as plt # Initialize index sizes n_samples = [int(1e3), int(1e4), int(1e5), int(1e6)] n_features = int(1e2) n_queries = 100 n_neighbors = 10 X_index, X_query = make_data(np.max(n_samples), n_features, n_queries, random_state=0) params_list = [{'n_estimators': 3, 'n_candidates': 50}, {'n_estimators': 5, 'n_candidates': 70}, {'n_estimators': 10, 'n_candidates': 100}] accuracies = np.zeros((len(n_samples), len(params_list)), dtype=float) speed_ups = np.zeros((len(n_samples), len(params_list)), dtype=float) for i, sample_size in enumerate(n_samples): print ('==========================================================') print ('Sample size: %i' % sample_size) print ('------------------------') exact_neighbors, average_time_exact = calc_exact_neighbors( X_index[:sample_size], X_query, n_queries, n_neighbors) for j, params in enumerate(params_list): print ('LSHF parameters: n_estimators = %i, n_candidates = %i' % (params['n_estimators'], params['n_candidates'])) speed_ups[i, j], accuracies[i, j] = calc_accuracy( X_index[:sample_size], X_query, n_queries, n_neighbors, exact_neighbors, average_time_exact, random_state=0, **params) print ('') print ('==========================================================') # Set labels for LSHForest parameters colors = ['c', 'm', 'y'] legend_rects = [plt.Rectangle((0, 0), 0.1, 0.1, fc=color) for color in colors] legend_labels = ['n_estimators={n_estimators}, ' 'n_candidates={n_candidates}'.format(**p) for p in params_list] # Plot precision plt.figure() plt.legend(legend_rects, legend_labels, loc='upper left') for i in range(len(params_list)): plt.scatter(n_samples, accuracies[:, i], c=colors[i]) plt.plot(n_samples, accuracies[:, i], c=colors[i]) plt.ylim([0, 1.3]) plt.xlim(np.min(n_samples), np.max(n_samples)) plt.semilogx() plt.ylabel("Precision@10") plt.xlabel("Index size") plt.grid(which='both') plt.title("Precision of first 10 neighbors with index size") # Plot speed up plt.figure() plt.legend(legend_rects, legend_labels, loc='upper left') for i in range(len(params_list)): plt.scatter(n_samples, speed_ups[:, i], c=colors[i]) plt.plot(n_samples, speed_ups[:, i], c=colors[i]) plt.ylim(0, np.max(speed_ups)) plt.xlim(np.min(n_samples), np.max(n_samples)) plt.semilogx() plt.ylabel("Speed up") plt.xlabel("Index size") plt.grid(which='both') plt.title("Relationship between Speed up and index size") plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_fastkmeans.py000066400000000000000000000111041261700307500232020ustar00rootroot00000000000000from __future__ import print_function from collections import defaultdict from time import time import numpy as np from numpy import random as nr from sklearn.cluster.k_means_ import KMeans, MiniBatchKMeans def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) chunk = 100 max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() data = nr.random_integers(-50, 50, (n_samples, n_features)) print('K-Means') tstart = time() kmeans = KMeans(init='k-means++', n_clusters=10).fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.5f" % kmeans.inertia_) print() results['kmeans_speed'].append(delta) results['kmeans_quality'].append(kmeans.inertia_) print('Fast K-Means') # let's prepare the data in small chunks mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=10, batch_size=chunk) tstart = time() mbkmeans.fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %f" % mbkmeans.inertia_) print() print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results def compute_bench_2(chunks): results = defaultdict(lambda: []) n_features = 50000 means = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0.5, 0.5], [0.75, -0.5], [-1, 0.75], [1, 0]]) X = np.empty((0, 2)) for i in range(8): X = np.r_[X, means[i] + 0.8 * np.random.randn(n_features, 2)] max_it = len(chunks) it = 0 for chunk in chunks: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() print('Fast K-Means') tstart = time() mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=8, batch_size=chunk) mbkmeans.fit(X) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.3fs" % mbkmeans.inertia_) print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(50, 150, 5).astype(np.int) features_range = np.linspace(150, 50000, 5).astype(np.int) chunks = np.linspace(500, 10000, 15).astype(np.int) results = compute_bench(samples_range, features_range) results_2 = compute_bench_2(chunks) max_time = max([max(i) for i in [t for (label, t) in results.iteritems() if "speed" in label]]) max_inertia = max([max(i) for i in [ t for (label, t) in results.iteritems() if "speed" not in label]]) fig = plt.figure('scikit-learn K-Means benchmark results') for c, (label, timings) in zip('brcy', sorted(results.iteritems())): if 'speed' in label: ax = fig.add_subplot(2, 2, 1, projection='3d') ax.set_zlim3d(0.0, max_time * 1.1) else: ax = fig.add_subplot(2, 2, 2, projection='3d') ax.set_zlim3d(0.0, max_inertia * 1.1) X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.5) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') i = 0 for c, (label, timings) in zip('br', sorted(results_2.iteritems())): i += 1 ax = fig.add_subplot(2, 2, i + 2) y = np.asarray(timings) ax.plot(chunks, y, color=c, alpha=0.8) ax.set_xlabel('Chunks') ax.set_ylabel(label) plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_incremental_pca.py000066400000000000000000000144361261700307500242050ustar00rootroot00000000000000""" ======================== IncrementalPCA benchmark ======================== Benchmarks for IncrementalPCA """ import numpy as np import gc from time import time from collections import defaultdict import matplotlib.pyplot as plt from sklearn.datasets import fetch_lfw_people from sklearn.decomposition import IncrementalPCA, RandomizedPCA, PCA def plot_results(X, y, label): plt.plot(X, y, label=label, marker='o') def benchmark(estimator, data): gc.collect() print("Benching %s" % estimator) t0 = time() estimator.fit(data) training_time = time() - t0 data_t = estimator.transform(data) data_r = estimator.inverse_transform(data_t) reconstruction_error = np.mean(np.abs(data - data_r)) return {'time': training_time, 'error': reconstruction_error} def plot_feature_times(all_times, batch_size, all_components, data): plt.figure() plot_results(all_components, all_times['pca'], label="PCA") plot_results(all_components, all_times['ipca'], label="IncrementalPCA, bsize=%i" % batch_size) plot_results(all_components, all_times['rpca'], label="RandomizedPCA") plt.legend(loc="upper left") plt.suptitle("Algorithm runtime vs. n_components\n \ LFW, size %i x %i" % data.shape) plt.xlabel("Number of components (out of max %i)" % data.shape[1]) plt.ylabel("Time (seconds)") def plot_feature_errors(all_errors, batch_size, all_components, data): plt.figure() plot_results(all_components, all_errors['pca'], label="PCA") plot_results(all_components, all_errors['ipca'], label="IncrementalPCA, bsize=%i" % batch_size) plot_results(all_components, all_errors['rpca'], label="RandomizedPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm error vs. n_components\n" "LFW, size %i x %i" % data.shape) plt.xlabel("Number of components (out of max %i)" % data.shape[1]) plt.ylabel("Mean absolute error") def plot_batch_times(all_times, n_features, all_batch_sizes, data): plt.figure() plot_results(all_batch_sizes, all_times['pca'], label="PCA") plot_results(all_batch_sizes, all_times['rpca'], label="RandomizedPCA") plot_results(all_batch_sizes, all_times['ipca'], label="IncrementalPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm runtime vs. batch_size for n_components %i\n \ LFW, size %i x %i" % ( n_features, data.shape[0], data.shape[1])) plt.xlabel("Batch size") plt.ylabel("Time (seconds)") def plot_batch_errors(all_errors, n_features, all_batch_sizes, data): plt.figure() plot_results(all_batch_sizes, all_errors['pca'], label="PCA") plot_results(all_batch_sizes, all_errors['ipca'], label="IncrementalPCA") plt.legend(loc="lower left") plt.suptitle("Algorithm error vs. batch_size for n_components %i\n \ LFW, size %i x %i" % ( n_features, data.shape[0], data.shape[1])) plt.xlabel("Batch size") plt.ylabel("Mean absolute error") def fixed_batch_size_comparison(data): all_features = [i.astype(int) for i in np.linspace(data.shape[1] // 10, data.shape[1], num=5)] batch_size = 1000 # Compare runtimes and error for fixed batch size all_times = defaultdict(list) all_errors = defaultdict(list) for n_components in all_features: pca = PCA(n_components=n_components) rpca = RandomizedPCA(n_components=n_components, random_state=1999) ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size) results_dict = {k: benchmark(est, data) for k, est in [('pca', pca), ('ipca', ipca), ('rpca', rpca)]} for k in sorted(results_dict.keys()): all_times[k].append(results_dict[k]['time']) all_errors[k].append(results_dict[k]['error']) plot_feature_times(all_times, batch_size, all_features, data) plot_feature_errors(all_errors, batch_size, all_features, data) def variable_batch_size_comparison(data): batch_sizes = [i.astype(int) for i in np.linspace(data.shape[0] // 10, data.shape[0], num=10)] for n_components in [i.astype(int) for i in np.linspace(data.shape[1] // 10, data.shape[1], num=4)]: all_times = defaultdict(list) all_errors = defaultdict(list) pca = PCA(n_components=n_components) rpca = RandomizedPCA(n_components=n_components, random_state=1999) results_dict = {k: benchmark(est, data) for k, est in [('pca', pca), ('rpca', rpca)]} # Create flat baselines to compare the variation over batch size all_times['pca'].extend([results_dict['pca']['time']] * len(batch_sizes)) all_errors['pca'].extend([results_dict['pca']['error']] * len(batch_sizes)) all_times['rpca'].extend([results_dict['rpca']['time']] * len(batch_sizes)) all_errors['rpca'].extend([results_dict['rpca']['error']] * len(batch_sizes)) for batch_size in batch_sizes: ipca = IncrementalPCA(n_components=n_components, batch_size=batch_size) results_dict = {k: benchmark(est, data) for k, est in [('ipca', ipca)]} all_times['ipca'].append(results_dict['ipca']['time']) all_errors['ipca'].append(results_dict['ipca']['error']) plot_batch_times(all_times, n_components, batch_sizes, data) # RandomizedPCA error is always worse (approx 100x) than other PCA # tests plot_batch_errors(all_errors, n_components, batch_sizes, data) faces = fetch_lfw_people(resize=.2, min_faces_per_person=5) # limit dataset to 5000 people (don't care who they are!) X = faces.data[:5000] n_samples, h, w = faces.images.shape n_features = X.shape[1] X -= X.mean(axis=0) X /= X.std(axis=0) fixed_batch_size_comparison(X) variable_batch_size_comparison(X) plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_lasso_path.py000066400000000000000000000076431261700307500232200ustar00rootroot00000000000000"""Benchmarks of Lasso regularization path computation using Lars and CD The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function from collections import defaultdict import gc import sys from time import time import numpy as np from sklearn.linear_model import lars_path from sklearn.linear_model import lasso_path from sklearn.datasets.samples_generator import make_regression def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') dataset_kwargs = { 'n_samples': n_samples, 'n_features': n_features, 'n_informative': n_features / 10, 'effective_rank': min(n_samples, n_features) / 10, #'effective_rank': None, 'bias': 0.0, } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) X, y = make_regression(**dataset_kwargs) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (with Gram)'].append(delta) gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (without Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (with Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=True) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (with Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (without Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=False) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (without Gram)'].append(delta) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(10, 2000, 5).astype(np.int) features_range = np.linspace(10, 2000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(max(t) for t in results.values()) fig = plt.figure('scikit-learn Lasso path benchmark results') i = 1 for c, (label, timings) in zip('bcry', sorted(results.items())): ax = fig.add_subplot(2, 2, i, projection='3d') X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.8) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) #ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.set_zlim3d(0.0, max_time * 1.1) ax.set_title(label) #ax.legend() i += 1 plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_neighbors.py000066400000000000000000000144411261700307500230350ustar00rootroot00000000000000""" Plot the scaling of the nearest neighbors algorithms with k, D, and N """ from time import time import numpy as np import pylab as pl from matplotlib import ticker from sklearn import neighbors, datasets def get_data(N, D, dataset='dense'): if dataset == 'dense': np.random.seed(0) return np.random.random((N, D)) elif dataset == 'digits': X = datasets.load_digits().data i = np.argsort(X[0])[::-1] X = X[:, i] return X[:N, :D] else: raise ValueError("invalid dataset: %s" % dataset) def barplot_neighbors(Nrange=2 ** np.arange(1, 11), Drange=2 ** np.arange(7), krange=2 ** np.arange(10), N=1000, D=64, k=5, leaf_size=30, dataset='digits'): algorithms = ('kd_tree', 'brute', 'ball_tree') fiducial_values = {'N': N, 'D': D, 'k': k} #------------------------------------------------------------ # varying N N_results_build = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) N_results_query = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) for i, NN in enumerate(Nrange): print("N = %i (%i out of %i)" % (NN, i + 1, len(Nrange))) X = get_data(NN, D, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=min(NN, k), algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() N_results_build[algorithm][i] = (t1 - t0) N_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying D D_results_build = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) D_results_query = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) for i, DD in enumerate(Drange): print("D = %i (%i out of %i)" % (DD, i + 1, len(Drange))) X = get_data(N, DD, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() D_results_build[algorithm][i] = (t1 - t0) D_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying k k_results_build = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) k_results_query = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) X = get_data(N, DD, dataset) for i, kk in enumerate(krange): print("k = %i (%i out of %i)" % (kk, i + 1, len(krange))) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=kk, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() k_results_build[algorithm][i] = (t1 - t0) k_results_query[algorithm][i] = (t2 - t1) pl.figure(figsize=(8, 11)) for (sbplt, vals, quantity, build_time, query_time) in [(311, Nrange, 'N', N_results_build, N_results_query), (312, Drange, 'D', D_results_build, D_results_query), (313, krange, 'k', k_results_build, k_results_query)]: ax = pl.subplot(sbplt, yscale='log') pl.grid(True) tick_vals = [] tick_labels = [] bottom = 10 ** np.min([min(np.floor(np.log10(build_time[alg]))) for alg in algorithms]) for i, alg in enumerate(algorithms): xvals = 0.1 + i * (1 + len(vals)) + np.arange(len(vals)) width = 0.8 c_bar = pl.bar(xvals, build_time[alg] - bottom, width, bottom, color='r') q_bar = pl.bar(xvals, query_time[alg], width, build_time[alg], color='b') tick_vals += list(xvals + 0.5 * width) tick_labels += ['%i' % val for val in vals] pl.text((i + 0.02) / len(algorithms), 0.98, alg, transform=ax.transAxes, ha='left', va='top', bbox=dict(facecolor='w', edgecolor='w', alpha=0.5)) pl.ylabel('Time (s)') ax.xaxis.set_major_locator(ticker.FixedLocator(tick_vals)) ax.xaxis.set_major_formatter(ticker.FixedFormatter(tick_labels)) for label in ax.get_xticklabels(): label.set_rotation(-90) label.set_fontsize(10) title_string = 'Varying %s' % quantity descr_string = '' for s in 'NDk': if s == quantity: pass else: descr_string += '%s = %i, ' % (s, fiducial_values[s]) descr_string = descr_string[:-2] pl.text(1.01, 0.5, title_string, transform=ax.transAxes, rotation=-90, ha='left', va='center', fontsize=20) pl.text(0.99, 0.5, descr_string, transform=ax.transAxes, rotation=-90, ha='right', va='center') pl.gcf().suptitle("%s data set" % dataset.capitalize(), fontsize=16) pl.figlegend((c_bar, q_bar), ('construction', 'N-point query'), 'upper right') if __name__ == '__main__': barplot_neighbors(dataset='digits') barplot_neighbors(dataset='dense') pl.show() scikit-learn-0.17.0/benchmarks/bench_plot_nmf.py000066400000000000000000000131561261700307500216370ustar00rootroot00000000000000""" Benchmarks of Non-Negative Matrix Factorization """ from __future__ import print_function from collections import defaultdict import gc from time import time import numpy as np from scipy.linalg import norm from sklearn.decomposition.nmf import NMF, _initialize_nmf from sklearn.datasets.samples_generator import make_low_rank_matrix from sklearn.externals.six.moves import xrange def alt_nnmf(V, r, max_iter=1000, tol=1e-3, init='random'): ''' A, S = nnmf(X, r, tol=1e-3, R=None) Implement Lee & Seung's algorithm Parameters ---------- V : 2-ndarray, [n_samples, n_features] input matrix r : integer number of latent features max_iter : integer, optional maximum number of iterations (default: 1000) tol : double tolerance threshold for early exit (when the update factor is within tol of 1., the function exits) init : string Method used to initialize the procedure. Returns ------- A : 2-ndarray, [n_samples, r] Component part of the factorization S : 2-ndarray, [r, n_features] Data part of the factorization Reference --------- "Algorithms for Non-negative Matrix Factorization" by Daniel D Lee, Sebastian H Seung (available at http://citeseer.ist.psu.edu/lee01algorithms.html) ''' # Nomenclature in the function follows Lee & Seung eps = 1e-5 n, m = V.shape W, H = _initialize_nmf(V, r, init, random_state=0) for i in xrange(max_iter): updateH = np.dot(W.T, V) / (np.dot(np.dot(W.T, W), H) + eps) H *= updateH updateW = np.dot(V, H.T) / (np.dot(W, np.dot(H, H.T)) + eps) W *= updateW if i % 10 == 0: max_update = max(updateW.max(), updateH.max()) if abs(1. - max_update) < tol: break return W, H def report(error, time): print("Frobenius loss: %.5f" % error) print("Took: %.2fs" % time) print() def benchmark(samples_range, features_range, rank=50, tolerance=1e-5): timeset = defaultdict(lambda: []) err = defaultdict(lambda: []) for n_samples in samples_range: for n_features in features_range: print("%2d samples, %2d features" % (n_samples, n_features)) print('=======================') X = np.abs(make_low_rank_matrix(n_samples, n_features, effective_rank=rank, tail_strength=0.2)) gc.collect() print("benchmarking nndsvd-nmf: ") tstart = time() m = NMF(n_components=30, tol=tolerance, init='nndsvd').fit(X) tend = time() - tstart timeset['nndsvd-nmf'].append(tend) err['nndsvd-nmf'].append(m.reconstruction_err_) report(m.reconstruction_err_, tend) gc.collect() print("benchmarking nndsvda-nmf: ") tstart = time() m = NMF(n_components=30, init='nndsvda', tol=tolerance).fit(X) tend = time() - tstart timeset['nndsvda-nmf'].append(tend) err['nndsvda-nmf'].append(m.reconstruction_err_) report(m.reconstruction_err_, tend) gc.collect() print("benchmarking nndsvdar-nmf: ") tstart = time() m = NMF(n_components=30, init='nndsvdar', tol=tolerance).fit(X) tend = time() - tstart timeset['nndsvdar-nmf'].append(tend) err['nndsvdar-nmf'].append(m.reconstruction_err_) report(m.reconstruction_err_, tend) gc.collect() print("benchmarking random-nmf") tstart = time() m = NMF(n_components=30, init='random', max_iter=1000, tol=tolerance).fit(X) tend = time() - tstart timeset['random-nmf'].append(tend) err['random-nmf'].append(m.reconstruction_err_) report(m.reconstruction_err_, tend) gc.collect() print("benchmarking alt-random-nmf") tstart = time() W, H = alt_nnmf(X, r=30, init='random', tol=tolerance) tend = time() - tstart timeset['alt-random-nmf'].append(tend) err['alt-random-nmf'].append(np.linalg.norm(X - np.dot(W, H))) report(norm(X - np.dot(W, H)), tend) return timeset, err if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection axes3d import matplotlib.pyplot as plt samples_range = np.linspace(50, 500, 3).astype(np.int) features_range = np.linspace(50, 500, 3).astype(np.int) timeset, err = benchmark(samples_range, features_range) for i, results in enumerate((timeset, err)): fig = plt.figure('scikit-learn Non-Negative Matrix Factorization' 'benchmark results') ax = fig.gca(projection='3d') for c, (label, timings) in zip('rbgcm', sorted(results.iteritems())): X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3, color=c) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') zlabel = 'Time (s)' if i == 0 else 'reconstruction error' ax.set_zlabel(zlabel) ax.legend() plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_omp_lars.py000066400000000000000000000105371261700307500226730ustar00rootroot00000000000000"""Benchmarks of orthogonal matching pursuit (:ref:`OMP`) versus least angle regression (:ref:`least_angle_regression`) The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function import gc import sys from time import time import numpy as np from sklearn.linear_model import lars_path, orthogonal_mp from sklearn.datasets.samples_generator import make_sparse_coded_signal def compute_bench(samples_range, features_range): it = 0 results = dict() lars = np.empty((len(features_range), len(samples_range))) lars_gram = lars.copy() omp = lars.copy() omp_gram = lars.copy() max_it = len(samples_range) * len(features_range) for i_s, n_samples in enumerate(samples_range): for i_f, n_features in enumerate(features_range): it += 1 n_informative = n_features / 10 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') # dataset_kwargs = { # 'n_train_samples': n_samples, # 'n_test_samples': 2, # 'n_features': n_features, # 'n_informative': n_informative, # 'effective_rank': min(n_samples, n_features) / 10, # #'effective_rank': None, # 'bias': 0.0, # } dataset_kwargs = { 'n_samples': 1, 'n_components': n_features, 'n_features': n_samples, 'n_nonzero_coefs': n_informative, 'random_state': 0 } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) y, X, _ = make_sparse_coded_signal(**dataset_kwargs) X = np.asfortranarray(X) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars_gram[i_f, i_s] = delta gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, Gram=None, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (with Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute=True, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp_gram[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (without Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute=False, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp[i_f, i_s] = delta results['time(LARS) / time(OMP)\n (w/ Gram)'] = (lars_gram / omp_gram) results['time(LARS) / time(OMP)\n (w/o Gram)'] = (lars / omp) return results if __name__ == '__main__': samples_range = np.linspace(1000, 5000, 5).astype(np.int) features_range = np.linspace(1000, 5000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(np.max(t) for t in results.values()) import pylab as pl fig = pl.figure('scikit-learn OMP vs. LARS benchmark results') for i, (label, timings) in enumerate(sorted(results.iteritems())): ax = fig.add_subplot(1, 2, i) vmax = max(1 - timings.min(), -1 + timings.max()) pl.matshow(timings, fignum=False, vmin=1 - vmax, vmax=1 + vmax) ax.set_xticklabels([''] + map(str, samples_range)) ax.set_yticklabels([''] + map(str, features_range)) pl.xlabel('n_samples') pl.ylabel('n_features') pl.title(label) pl.subplots_adjust(0.1, 0.08, 0.96, 0.98, 0.4, 0.63) ax = pl.axes([0.1, 0.08, 0.8, 0.06]) pl.colorbar(cax=ax, orientation='horizontal') pl.show() scikit-learn-0.17.0/benchmarks/bench_plot_parallel_pairwise.py000066400000000000000000000023371261700307500245550ustar00rootroot00000000000000# Author: Mathieu Blondel # License: BSD 3 clause import time import pylab as pl from sklearn.utils import check_random_state from sklearn.metrics.pairwise import pairwise_distances from sklearn.metrics.pairwise import pairwise_kernels def plot(func): random_state = check_random_state(0) one_core = [] multi_core = [] sample_sizes = range(1000, 6000, 1000) for n_samples in sample_sizes: X = random_state.rand(n_samples, 300) start = time.time() func(X, n_jobs=1) one_core.append(time.time() - start) start = time.time() func(X, n_jobs=-1) multi_core.append(time.time() - start) pl.figure('scikit-learn parallel %s benchmark results' % func.__name__) pl.plot(sample_sizes, one_core, label="one core") pl.plot(sample_sizes, multi_core, label="multi core") pl.xlabel('n_samples') pl.ylabel('Time (s)') pl.title('Parallel %s' % func.__name__) pl.legend() def euclidean_distances(X, n_jobs): return pairwise_distances(X, metric="euclidean", n_jobs=n_jobs) def rbf_kernels(X, n_jobs): return pairwise_kernels(X, metric="rbf", n_jobs=n_jobs, gamma=0.1) plot(euclidean_distances) plot(rbf_kernels) pl.show() scikit-learn-0.17.0/benchmarks/bench_plot_svd.py000066400000000000000000000055231261700307500216520ustar00rootroot00000000000000"""Benchmarks of Singular Value Decomposition (Exact and Approximate) The data is mostly low rank but is a fat infinite tail. """ import gc from time import time import numpy as np from collections import defaultdict from scipy.linalg import svd from sklearn.utils.extmath import randomized_svd from sklearn.datasets.samples_generator import make_low_rank_matrix def compute_bench(samples_range, features_range, n_iter=3, rank=50): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') X = make_low_rank_matrix(n_samples, n_features, effective_rank=rank, tail_strength=0.2) gc.collect() print("benchmarking scipy svd: ") tstart = time() svd(X, full_matrices=False) results['scipy svd'].append(time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=0") tstart = time() randomized_svd(X, rank, n_iter=0) results['scikit-learn randomized_svd (n_iter=0)'].append( time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=%d " % n_iter) tstart = time() randomized_svd(X, rank, n_iter=n_iter) results['scikit-learn randomized_svd (n_iter=%d)' % n_iter].append(time() - tstart) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(2, 1000, 4).astype(np.int) features_range = np.linspace(2, 1000, 4).astype(np.int) results = compute_bench(samples_range, features_range) label = 'scikit-learn singular value decomposition benchmark results' fig = plt.figure(label) ax = fig.gca(projection='3d') for c, (label, timings) in zip('rbg', sorted(results.iteritems())): X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3, color=c) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.legend() plt.show() scikit-learn-0.17.0/benchmarks/bench_plot_ward.py000066400000000000000000000023541261700307500220120ustar00rootroot00000000000000""" Benchmark scikit-learn's Ward implement compared to SciPy's """ import time import numpy as np from scipy.cluster import hierarchy import pylab as pl from sklearn.cluster import AgglomerativeClustering ward = AgglomerativeClustering(n_clusters=3, linkage='ward') n_samples = np.logspace(.5, 3, 9) n_features = np.logspace(1, 3.5, 7) N_samples, N_features = np.meshgrid(n_samples, n_features) scikits_time = np.zeros(N_samples.shape) scipy_time = np.zeros(N_samples.shape) for i, n in enumerate(n_samples): for j, p in enumerate(n_features): X = np.random.normal(size=(n, p)) t0 = time.time() ward.fit(X) scikits_time[j, i] = time.time() - t0 t0 = time.time() hierarchy.ward(X) scipy_time[j, i] = time.time() - t0 ratio = scikits_time / scipy_time pl.figure("scikit-learn Ward's method benchmark results") pl.imshow(np.log(ratio), aspect='auto', origin="lower") pl.colorbar() pl.contour(ratio, levels=[1, ], colors='k') pl.yticks(range(len(n_features)), n_features.astype(np.int)) pl.ylabel('N features') pl.xticks(range(len(n_samples)), n_samples.astype(np.int)) pl.xlabel('N samples') pl.title("Scikit's time, in units of scipy time (log)") pl.show() scikit-learn-0.17.0/benchmarks/bench_random_projections.py000066400000000000000000000213041261700307500237120ustar00rootroot00000000000000""" =========================== Random projection benchmark =========================== Benchmarks for random projections. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import collections import numpy as np import scipy.sparse as sp from sklearn import clone from sklearn.externals.six.moves import xrange from sklearn.random_projection import (SparseRandomProjection, GaussianRandomProjection, johnson_lindenstrauss_min_dim) def type_auto_or_float(val): if val == "auto": return "auto" else: return float(val) def type_auto_or_int(val): if val == "auto": return "auto" else: return int(val) def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_scikit_transformer(X, transfomer): gc.collect() clf = clone(transfomer) # start time t_start = datetime.now() clf.fit(X) delta = (datetime.now() - t_start) # stop time time_to_fit = compute_time(t_start, delta) # start time t_start = datetime.now() clf.transform(X) delta = (datetime.now() - t_start) # stop time time_to_transform = compute_time(t_start, delta) return time_to_fit, time_to_transform # Make some random data with uniformly located non zero entries with # Gaussian distributed values def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=None): rng = np.random.RandomState(random_state) data_coo = sp.coo_matrix( (rng.randn(n_nonzeros), (rng.randint(n_samples, size=n_nonzeros), rng.randint(n_features, size=n_nonzeros))), shape=(n_samples, n_features)) return data_coo.toarray(), data_coo.tocsr() def print_row(clf_type, time_fit, time_transform): print("%s | %s | %s" % (clf_type.ljust(30), ("%.4fs" % time_fit).center(12), ("%.4fs" % time_transform).center(12))) if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-features", dest="n_features", default=10 ** 4, type=int, help="Number of features in the benchmarks") op.add_option("--n-components", dest="n_components", default="auto", help="Size of the random subspace." " ('auto' or int > 0)") op.add_option("--ratio-nonzeros", dest="ratio_nonzeros", default=10 ** -3, type=float, help="Number of features in the benchmarks") op.add_option("--n-samples", dest="n_samples", default=500, type=int, help="Number of samples in the benchmarks") op.add_option("--random-seed", dest="random_seed", default=13, type=int, help="Seed used by the random number generators.") op.add_option("--density", dest="density", default=1 / 3, help="Density used by the sparse random projection." " ('auto' or float (0.0, 1.0]") op.add_option("--eps", dest="eps", default=0.5, type=float, help="See the documentation of the underlying transformers.") op.add_option("--transformers", dest="selected_transformers", default='GaussianRandomProjection,SparseRandomProjection', type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. Available: " "GaussianRandomProjection,SparseRandomProjection") op.add_option("--dense", dest="dense", default=False, action="store_true", help="Set input space as a dense matrix.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) opts.n_components = type_auto_or_int(opts.n_components) opts.density = type_auto_or_float(opts.density) selected_transformers = opts.selected_transformers.split(',') ########################################################################### # Generate dataset ########################################################################### n_nonzeros = int(opts.ratio_nonzeros * opts.n_features) print('Dataset statics') print("===========================") print('n_samples \t= %s' % opts.n_samples) print('n_features \t= %s' % opts.n_features) if opts.n_components == "auto": print('n_components \t= %s (auto)' % johnson_lindenstrauss_min_dim(n_samples=opts.n_samples, eps=opts.eps)) else: print('n_components \t= %s' % opts.n_components) print('n_elements \t= %s' % (opts.n_features * opts.n_samples)) print('n_nonzeros \t= %s per feature' % n_nonzeros) print('ratio_nonzeros \t= %s' % opts.ratio_nonzeros) print('') ########################################################################### # Set transformer input ########################################################################### transformers = {} ########################################################################### # Set GaussianRandomProjection input gaussian_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed } transformers["GaussianRandomProjection"] = \ GaussianRandomProjection(**gaussian_matrix_params) ########################################################################### # Set SparseRandomProjection input sparse_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed, "density": opts.density, "eps": opts.eps, } transformers["SparseRandomProjection"] = \ SparseRandomProjection(**sparse_matrix_params) ########################################################################### # Perform benchmark ########################################################################### time_fit = collections.defaultdict(list) time_transform = collections.defaultdict(list) print('Benchmarks') print("===========================") print("Generate dataset benchmarks... ", end="") X_dense, X_sparse = make_sparse_random_data(opts.n_samples, opts.n_features, n_nonzeros, random_state=opts.random_seed) X = X_dense if opts.dense else X_sparse print("done") for name in selected_transformers: print("Perform benchmarks for %s..." % name) for iteration in xrange(opts.n_times): print("\titer %s..." % iteration, end="") time_to_fit, time_to_transform = bench_scikit_transformer(X_dense, transformers[name]) time_fit[name].append(time_to_fit) time_transform[name].append(time_to_transform) print("done") print("") ########################################################################### # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Transformer performance:") print("===========================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") print("%s | %s | %s" % ("Transformer".ljust(30), "fit".center(12), "transform".center(12))) print(31 * "-" + ("|" + "-" * 14) * 2) for name in sorted(selected_transformers): print_row(name, np.mean(time_fit[name]), np.mean(time_transform[name])) print("") print("") scikit-learn-0.17.0/benchmarks/bench_rcv1_logreg_convergence.py000066400000000000000000000160051261700307500246050ustar00rootroot00000000000000# Authors: Tom Dupre la Tour # Olivier Grisel # # License: BSD 3 clause import matplotlib.pyplot as plt import numpy as np import gc import time from sklearn.externals.joblib import Memory from sklearn.linear_model import (LogisticRegression, SGDClassifier) from sklearn.datasets import fetch_rcv1 from sklearn.linear_model.sag import get_auto_step_size from sklearn.linear_model.sag_fast import get_max_squared_sum try: import lightning.classification as lightning_clf except ImportError: lightning_clf = None m = Memory(cachedir='.', verbose=0) # compute logistic loss def get_loss(w, intercept, myX, myy, C): n_samples = myX.shape[0] w = w.ravel() p = np.mean(np.log(1. + np.exp(-myy * (myX.dot(w) + intercept)))) print("%f + %f" % (p, w.dot(w) / 2. / C / n_samples)) p += w.dot(w) / 2. / C / n_samples return p # We use joblib to cache individual fits. Note that we do not pass the dataset # as argument as the hashing would be too slow, so we assume that the dataset # never changes. @m.cache() def bench_one(name, clf_type, clf_params, n_iter): clf = clf_type(**clf_params) try: clf.set_params(max_iter=n_iter, random_state=42) except: clf.set_params(n_iter=n_iter, random_state=42) st = time.time() clf.fit(X, y) end = time.time() try: C = 1.0 / clf.alpha / n_samples except: C = clf.C try: intercept = clf.intercept_ except: intercept = 0. train_loss = get_loss(clf.coef_, intercept, X, y, C) train_score = clf.score(X, y) test_score = clf.score(X_test, y_test) duration = end - st return train_loss, train_score, test_score, duration def bench(clfs): for (name, clf, iter_range, train_losses, train_scores, test_scores, durations) in clfs: print("training %s" % name) clf_type = type(clf) clf_params = clf.get_params() for n_iter in iter_range: gc.collect() train_loss, train_score, test_score, duration = bench_one( name, clf_type, clf_params, n_iter) train_losses.append(train_loss) train_scores.append(train_score) test_scores.append(test_score) durations.append(duration) print("classifier: %s" % name) print("train_loss: %.8f" % train_loss) print("train_score: %.8f" % train_score) print("test_score: %.8f" % test_score) print("time for fit: %.8f seconds" % duration) print("") print("") return clfs def plot_train_losses(clfs): plt.figure() for (name, _, _, train_losses, _, _, durations) in clfs: plt.plot(durations, train_losses, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("train loss") def plot_train_scores(clfs): plt.figure() for (name, _, _, _, train_scores, _, durations) in clfs: plt.plot(durations, train_scores, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("train score") plt.ylim((0.92, 0.96)) def plot_test_scores(clfs): plt.figure() for (name, _, _, _, _, test_scores, durations) in clfs: plt.plot(durations, test_scores, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("test score") plt.ylim((0.92, 0.96)) def plot_dloss(clfs): plt.figure() pobj_final = [] for (name, _, _, train_losses, _, _, durations) in clfs: pobj_final.append(train_losses[-1]) indices = np.argsort(pobj_final) pobj_best = pobj_final[indices[0]] for (name, _, _, train_losses, _, _, durations) in clfs: log_pobj = np.log(abs(np.array(train_losses) - pobj_best)) / np.log(10) plt.plot(durations, log_pobj, '-o', label=name) plt.legend(loc=0) plt.xlabel("seconds") plt.ylabel("log(best - train_loss)") rcv1 = fetch_rcv1() X = rcv1.data n_samples, n_features = X.shape # consider the binary classification problem 'CCAT' vs the rest ccat_idx = rcv1.target_names.tolist().index('CCAT') y = rcv1.target.tocsc()[:, ccat_idx].toarray().ravel().astype(np.float64) y[y == 0] = -1 # parameters C = 1. fit_intercept = True tol = 1.0e-14 # max_iter range sgd_iter_range = list(range(1, 121, 10)) newton_iter_range = list(range(1, 25, 3)) lbfgs_iter_range = list(range(1, 242, 12)) liblinear_iter_range = list(range(1, 37, 3)) liblinear_dual_iter_range = list(range(1, 85, 6)) sag_iter_range = list(range(1, 37, 3)) clfs = [ ("LR-liblinear", LogisticRegression(C=C, tol=tol, solver="liblinear", fit_intercept=fit_intercept, intercept_scaling=1), liblinear_iter_range, [], [], [], []), ("LR-liblinear-dual", LogisticRegression(C=C, tol=tol, dual=True, solver="liblinear", fit_intercept=fit_intercept, intercept_scaling=1), liblinear_dual_iter_range, [], [], [], []), ("LR-SAG", LogisticRegression(C=C, tol=tol, solver="sag", fit_intercept=fit_intercept), sag_iter_range, [], [], [], []), ("LR-newton-cg", LogisticRegression(C=C, tol=tol, solver="newton-cg", fit_intercept=fit_intercept), newton_iter_range, [], [], [], []), ("LR-lbfgs", LogisticRegression(C=C, tol=tol, solver="lbfgs", fit_intercept=fit_intercept), lbfgs_iter_range, [], [], [], []), ("SGD", SGDClassifier(alpha=1.0 / C / n_samples, penalty='l2', loss='log', fit_intercept=fit_intercept, verbose=0), sgd_iter_range, [], [], [], [])] if lightning_clf is not None and not fit_intercept: alpha = 1. / C / n_samples # compute the same step_size than in LR-sag max_squared_sum = get_max_squared_sum(X) step_size = get_auto_step_size(max_squared_sum, alpha, "log", fit_intercept) clfs.append( ("Lightning-SVRG", lightning_clf.SVRGClassifier(alpha=alpha, eta=step_size, tol=tol, loss="log"), sag_iter_range, [], [], [], [])) clfs.append( ("Lightning-SAG", lightning_clf.SAGClassifier(alpha=alpha, eta=step_size, tol=tol, loss="log"), sag_iter_range, [], [], [], [])) # We keep only 200 features, to have a dense dataset, # and compare to lightning SAG, which seems incorrect in the sparse case. X_csc = X.tocsc() nnz_in_each_features = X_csc.indptr[1:] - X_csc.indptr[:-1] X = X_csc[:, np.argsort(nnz_in_each_features)[-200:]] X = X.toarray() print("dataset: %.3f MB" % (X.nbytes / 1e6)) # Split training and testing. Switch train and test subset compared to # LYRL2004 split, to have a larger training dataset. n = 23149 X_test = X[:n, :] y_test = y[:n] X = X[n:, :] y = y[n:] clfs = bench(clfs) plot_train_scores(clfs) plot_test_scores(clfs) plot_train_losses(clfs) plot_dloss(clfs) plt.show() scikit-learn-0.17.0/benchmarks/bench_sample_without_replacement.py000066400000000000000000000175101261700307500254420ustar00rootroot00000000000000""" Benchmarks for sampling without replacement of integer. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import operator import matplotlib.pyplot as plt import numpy as np import random from sklearn.externals.six.moves import xrange from sklearn.utils.random import sample_without_replacement def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_sample(sampling, n_population, n_samples): gc.collect() # start time t_start = datetime.now() sampling(n_population, n_samples) delta = (datetime.now() - t_start) # stop time time = compute_time(t_start, delta) return time if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-population", dest="n_population", default=100000, type=int, help="Size of the population to sample from.") op.add_option("--n-step", dest="n_steps", default=5, type=int, help="Number of step interval between 0 and n_population.") default_algorithms = "custom-tracking-selection,custom-auto," \ "custom-reservoir-sampling,custom-pool,"\ "python-core-sample,numpy-permutation" op.add_option("--algorithm", dest="selected_algorithm", default=default_algorithms, type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. \nAvailable: %default") # op.add_option("--random-seed", # dest="random_seed", default=13, type=int, # help="Seed used by the random number generators.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) selected_algorithm = opts.selected_algorithm.split(',') for key in selected_algorithm: if key not in default_algorithms.split(','): raise ValueError("Unknown sampling algorithm \"%s\" not in (%s)." % (key, default_algorithms)) ########################################################################### # List sampling algorithm ########################################################################### # We assume that sampling algorithm has the following signature: # sample(n_population, n_sample) # sampling_algorithm = {} ########################################################################### # Set Python core input sampling_algorithm["python-core-sample"] = \ lambda n_population, n_sample: \ random.sample(xrange(n_population), n_sample) ########################################################################### # Set custom automatic method selection sampling_algorithm["custom-auto"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="auto", random_state=random_state) ########################################################################### # Set custom tracking based method sampling_algorithm["custom-tracking-selection"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="tracking_selection", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-reservoir-sampling"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="reservoir_sampling", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-pool"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="pool", random_state=random_state) ########################################################################### # Numpy permutation based sampling_algorithm["numpy-permutation"] = \ lambda n_population, n_sample: \ np.random.permutation(n_population)[:n_sample] ########################################################################### # Remove unspecified algorithm sampling_algorithm = dict((key, value) for key, value in sampling_algorithm.items() if key in selected_algorithm) ########################################################################### # Perform benchmark ########################################################################### time = {} n_samples = np.linspace(start=0, stop=opts.n_population, num=opts.n_steps).astype(np.int) ratio = n_samples / opts.n_population print('Benchmarks') print("===========================") for name in sorted(sampling_algorithm): print("Perform benchmarks for %s..." % name, end="") time[name] = np.zeros(shape=(opts.n_steps, opts.n_times)) for step in xrange(opts.n_steps): for it in xrange(opts.n_times): time[name][step, it] = bench_sample(sampling_algorithm[name], opts.n_population, n_samples[step]) print("done") print("Averaging results...", end="") for name in sampling_algorithm: time[name] = np.mean(time[name], axis=1) print("done\n") # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Sampling algorithm performance:") print("===============================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") fig = plt.figure('scikit-learn sample w/o replacement benchmark results') plt.title("n_population = %s, n_times = %s" % (opts.n_population, opts.n_times)) ax = fig.add_subplot(111) for name in sampling_algorithm: ax.plot(ratio, time[name], label=name) ax.set_xlabel('ratio of n_sample / n_population') ax.set_ylabel('Time (s)') ax.legend() # Sort legend labels handles, labels = ax.get_legend_handles_labels() hl = sorted(zip(handles, labels), key=operator.itemgetter(1)) handles2, labels2 = zip(*hl) ax.legend(handles2, labels2, loc=0) plt.show() scikit-learn-0.17.0/benchmarks/bench_sgd_regression.py000066400000000000000000000127011261700307500230310ustar00rootroot00000000000000""" Benchmark for SGD regression Compares SGD regression against coordinate descent and Ridge on synthetic data. """ print(__doc__) # Author: Peter Prettenhofer # License: BSD 3 clause import numpy as np import pylab as pl import gc from time import time from sklearn.linear_model import Ridge, SGDRegressor, ElasticNet from sklearn.metrics import mean_squared_error from sklearn.datasets.samples_generator import make_regression if __name__ == "__main__": list_n_samples = np.linspace(100, 10000, 5).astype(np.int) list_n_features = [10, 100, 1000] n_test = 1000 noise = 0.1 alpha = 0.01 sgd_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) elnet_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) ridge_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) asgd_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) for i, n_train in enumerate(list_n_samples): for j, n_features in enumerate(list_n_features): X, y, coef = make_regression( n_samples=n_train + n_test, n_features=n_features, noise=noise, coef=True) X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] print("=======================") print("Round %d %d" % (i, j)) print("n_features:", n_features) print("n_samples:", n_train) # Shuffle data idx = np.arange(n_train) np.random.seed(13) np.random.shuffle(idx) X_train = X_train[idx] y_train = y_train[idx] std = X_train.std(axis=0) mean = X_train.mean(axis=0) X_train = (X_train - mean) / std X_test = (X_test - mean) / std std = y_train.std(axis=0) mean = y_train.mean(axis=0) y_train = (y_train - mean) / std y_test = (y_test - mean) / std gc.collect() print("- benchmarking ElasticNet") clf = ElasticNet(alpha=alpha, l1_ratio=0.5, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) elnet_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) elnet_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking SGD") n_iter = np.ceil(10 ** 4.0 / n_train) clf = SGDRegressor(alpha=alpha / n_train, fit_intercept=False, n_iter=n_iter, learning_rate="invscaling", eta0=.01, power_t=0.25) tstart = time() clf.fit(X_train, y_train) sgd_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) sgd_results[i, j, 1] = time() - tstart gc.collect() print("n_iter", n_iter) print("- benchmarking A-SGD") n_iter = np.ceil(10 ** 4.0 / n_train) clf = SGDRegressor(alpha=alpha / n_train, fit_intercept=False, n_iter=n_iter, learning_rate="invscaling", eta0=.002, power_t=0.05, average=(n_iter * n_train // 2)) tstart = time() clf.fit(X_train, y_train) asgd_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) asgd_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking RidgeRegression") clf = Ridge(alpha=alpha, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) ridge_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) ridge_results[i, j, 1] = time() - tstart # Plot results i = 0 m = len(list_n_features) pl.figure('scikit-learn SGD regression benchmark results', figsize=(5 * 2, 4 * m)) for j in range(m): pl.subplot(m, 2, i + 1) pl.plot(list_n_samples, np.sqrt(elnet_results[:, j, 0]), label="ElasticNet") pl.plot(list_n_samples, np.sqrt(sgd_results[:, j, 0]), label="SGDRegressor") pl.plot(list_n_samples, np.sqrt(asgd_results[:, j, 0]), label="A-SGDRegressor") pl.plot(list_n_samples, np.sqrt(ridge_results[:, j, 0]), label="Ridge") pl.legend(prop={"size": 10}) pl.xlabel("n_train") pl.ylabel("RMSE") pl.title("Test error - %d features" % list_n_features[j]) i += 1 pl.subplot(m, 2, i + 1) pl.plot(list_n_samples, np.sqrt(elnet_results[:, j, 1]), label="ElasticNet") pl.plot(list_n_samples, np.sqrt(sgd_results[:, j, 1]), label="SGDRegressor") pl.plot(list_n_samples, np.sqrt(asgd_results[:, j, 1]), label="A-SGDRegressor") pl.plot(list_n_samples, np.sqrt(ridge_results[:, j, 1]), label="Ridge") pl.legend(prop={"size": 10}) pl.xlabel("n_train") pl.ylabel("Time [sec]") pl.title("Training time - %d features" % list_n_features[j]) i += 1 pl.subplots_adjust(hspace=.30) pl.show() scikit-learn-0.17.0/benchmarks/bench_sparsify.py000066400000000000000000000064541261700307500216640ustar00rootroot00000000000000""" Benchmark SGD prediction time with dense/sparse coefficients. Invoke with ----------- $ kernprof.py -l sparsity_benchmark.py $ python -m line_profiler sparsity_benchmark.py.lprof Typical output -------------- input data sparsity: 0.050000 true coef sparsity: 0.000100 test data sparsity: 0.027400 model sparsity: 0.000024 r^2 on test data (dense model) : 0.233651 r^2 on test data (sparse model) : 0.233651 Wrote profile results to sparsity_benchmark.py.lprof Timer unit: 1e-06 s File: sparsity_benchmark.py Function: benchmark_dense_predict at line 51 Total time: 0.532979 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 51 @profile 52 def benchmark_dense_predict(): 53 301 640 2.1 0.1 for _ in range(300): 54 300 532339 1774.5 99.9 clf.predict(X_test) File: sparsity_benchmark.py Function: benchmark_sparse_predict at line 56 Total time: 0.39274 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 56 @profile 57 def benchmark_sparse_predict(): 58 1 10854 10854.0 2.8 X_test_sparse = csr_matrix(X_test) 59 301 477 1.6 0.1 for _ in range(300): 60 300 381409 1271.4 97.1 clf.predict(X_test_sparse) """ from scipy.sparse.csr import csr_matrix import numpy as np from sklearn.linear_model.stochastic_gradient import SGDRegressor from sklearn.metrics import r2_score np.random.seed(42) def sparsity_ratio(X): return np.count_nonzero(X) / float(n_samples * n_features) n_samples, n_features = 5000, 300 X = np.random.randn(n_samples, n_features) inds = np.arange(n_samples) np.random.shuffle(inds) X[inds[int(n_features / 1.2):]] = 0 # sparsify input print("input data sparsity: %f" % sparsity_ratio(X)) coef = 3 * np.random.randn(n_features) inds = np.arange(n_features) np.random.shuffle(inds) coef[inds[n_features/2:]] = 0 # sparsify coef print("true coef sparsity: %f" % sparsity_ratio(coef)) y = np.dot(X, coef) # add noise y += 0.01 * np.random.normal((n_samples,)) # Split data in train set and test set n_samples = X.shape[0] X_train, y_train = X[:n_samples / 2], y[:n_samples / 2] X_test, y_test = X[n_samples / 2:], y[n_samples / 2:] print("test data sparsity: %f" % sparsity_ratio(X_test)) ############################################################################### clf = SGDRegressor(penalty='l1', alpha=.2, fit_intercept=True, n_iter=2000) clf.fit(X_train, y_train) print("model sparsity: %f" % sparsity_ratio(clf.coef_)) def benchmark_dense_predict(): for _ in range(300): clf.predict(X_test) def benchmark_sparse_predict(): X_test_sparse = csr_matrix(X_test) for _ in range(300): clf.predict(X_test_sparse) def score(y_test, y_pred, case): r2 = r2_score(y_test, y_pred) print("r^2 on test data (%s) : %f" % (case, r2)) score(y_test, clf.predict(X_test), 'dense model') benchmark_dense_predict() clf.sparsify() score(y_test, clf.predict(X_test), 'sparse model') benchmark_sparse_predict() scikit-learn-0.17.0/benchmarks/bench_tree.py000066400000000000000000000070411261700307500207540ustar00rootroot00000000000000""" To run this, you'll need to have installed. * scikit-learn Does two benchmarks First, we fix a training set, increase the number of samples to classify and plot number of classified samples as a function of time. In the second benchmark, we increase the number of dimensions of the training set, classify a sample and plot the time taken as a function of the number of dimensions. """ import numpy as np import pylab as pl import gc from datetime import datetime # to store the results scikit_classifier_results = [] scikit_regressor_results = [] mu_second = 0.0 + 10 ** 6 # number of microseconds in a second def bench_scikit_tree_classifier(X, Y): """Benchmark with scikit-learn decision tree classifier""" from sklearn.tree import DecisionTreeClassifier gc.collect() # start time tstart = datetime.now() clf = DecisionTreeClassifier() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_classifier_results.append( delta.seconds + delta.microseconds / mu_second) def bench_scikit_tree_regressor(X, Y): """Benchmark with scikit-learn decision tree regressor""" from sklearn.tree import DecisionTreeRegressor gc.collect() # start time tstart = datetime.now() clf = DecisionTreeRegressor() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_regressor_results.append( delta.seconds + delta.microseconds / mu_second) if __name__ == '__main__': print('============================================') print('Warning: this is going to take a looong time') print('============================================') n = 10 step = 10000 n_samples = 10000 dim = 10 n_classes = 10 for i in range(n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') n_samples += step X = np.random.randn(n_samples, dim) Y = np.random.randint(0, n_classes, (n_samples,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(n_samples) bench_scikit_tree_regressor(X, Y) xx = range(0, n * step, step) pl.figure('scikit-learn tree benchmark results') pl.subplot(211) pl.title('Learning with varying number of samples') pl.plot(xx, scikit_classifier_results, 'g-', label='classification') pl.plot(xx, scikit_regressor_results, 'r-', label='regression') pl.legend(loc='upper left') pl.xlabel('number of samples') pl.ylabel('Time (s)') scikit_classifier_results = [] scikit_regressor_results = [] n = 10 step = 500 start_dim = 500 n_classes = 10 dim = start_dim for i in range(0, n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') dim += step X = np.random.randn(100, dim) Y = np.random.randint(0, n_classes, (100,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(100) bench_scikit_tree_regressor(X, Y) xx = np.arange(start_dim, start_dim + n * step, step) pl.subplot(212) pl.title('Learning in high dimensional spaces') pl.plot(xx, scikit_classifier_results, 'g-', label='classification') pl.plot(xx, scikit_regressor_results, 'r-', label='regression') pl.legend(loc='upper left') pl.xlabel('number of dimensions') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.17.0/circle.yml000066400000000000000000000001371261700307500161520ustar00rootroot00000000000000general: # Restric the build to the branch master only branches: only: - master scikit-learn-0.17.0/continuous_integration/000077500000000000000000000000001261700307500207765ustar00rootroot00000000000000scikit-learn-0.17.0/continuous_integration/appveyor/000077500000000000000000000000001261700307500226435ustar00rootroot00000000000000scikit-learn-0.17.0/continuous_integration/appveyor/install.ps1000066400000000000000000000160331261700307500247410ustar00rootroot00000000000000# Sample script to install Python and pip under Windows # Authors: Olivier Grisel, Jonathan Helmus, Kyle Kastner, and Alex Willmer # License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ $MINICONDA_URL = "http://repo.continuum.io/miniconda/" $BASE_URL = "https://www.python.org/ftp/python/" $GET_PIP_URL = "https://bootstrap.pypa.io/get-pip.py" $GET_PIP_PATH = "C:\get-pip.py" $PYTHON_PRERELEASE_REGEX = @" (?x) (?\d+) \. (?\d+) \. (?\d+) (?[a-z]{1,2}\d+) "@ function Download ($filename, $url) { $webclient = New-Object System.Net.WebClient $basedir = $pwd.Path + "\" $filepath = $basedir + $filename if (Test-Path $filename) { Write-Host "Reusing" $filepath return $filepath } # Download and retry up to 3 times in case of network transient errors. Write-Host "Downloading" $filename "from" $url $retry_attempts = 2 for ($i = 0; $i -lt $retry_attempts; $i++) { try { $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } if (Test-Path $filepath) { Write-Host "File saved at" $filepath } else { # Retry once to get the error message if any at the last try $webclient.DownloadFile($url, $filepath) } return $filepath } function ParsePythonVersion ($python_version) { if ($python_version -match $PYTHON_PRERELEASE_REGEX) { return ([int]$matches.major, [int]$matches.minor, [int]$matches.micro, $matches.prerelease) } $version_obj = [version]$python_version return ($version_obj.major, $version_obj.minor, $version_obj.build, "") } function DownloadPython ($python_version, $platform_suffix) { $major, $minor, $micro, $prerelease = ParsePythonVersion $python_version if (($major -le 2 -and $micro -eq 0) ` -or ($major -eq 3 -and $minor -le 2 -and $micro -eq 0) ` ) { $dir = "$major.$minor" $python_version = "$major.$minor$prerelease" } else { $dir = "$major.$minor.$micro" } if ($prerelease) { if (($major -le 2) ` -or ($major -eq 3 -and $minor -eq 1) ` -or ($major -eq 3 -and $minor -eq 2) ` -or ($major -eq 3 -and $minor -eq 3) ` ) { $dir = "$dir/prev" } } if (($major -le 2) -or ($major -le 3 -and $minor -le 4)) { $ext = "msi" if ($platform_suffix) { $platform_suffix = ".$platform_suffix" } } else { $ext = "exe" if ($platform_suffix) { $platform_suffix = "-$platform_suffix" } } $filename = "python-$python_version$platform_suffix.$ext" $url = "$BASE_URL$dir/$filename" $filepath = Download $filename $url return $filepath } function InstallPython ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "" } else { $platform_suffix = "amd64" } $installer_path = DownloadPython $python_version $platform_suffix $installer_ext = [System.IO.Path]::GetExtension($installer_path) Write-Host "Installing $installer_path to $python_home" $install_log = $python_home + ".log" if ($installer_ext -eq '.msi') { InstallPythonMSI $installer_path $python_home $install_log } else { InstallPythonEXE $installer_path $python_home $install_log } if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function InstallPythonEXE ($exepath, $python_home, $install_log) { $install_args = "/quiet InstallAllUsers=1 TargetDir=$python_home" RunCommand $exepath $install_args } function InstallPythonMSI ($msipath, $python_home, $install_log) { $install_args = "/qn /log $install_log /i $msipath TARGETDIR=$python_home" $uninstall_args = "/qn /x $msipath" RunCommand "msiexec.exe" $install_args if (-not(Test-Path $python_home)) { Write-Host "Python seems to be installed else-where, reinstalling." RunCommand "msiexec.exe" $uninstall_args RunCommand "msiexec.exe" $install_args } } function RunCommand ($command, $command_args) { Write-Host $command $command_args Start-Process -FilePath $command -ArgumentList $command_args -Wait -Passthru } function InstallPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $python_path = $python_home + "\python.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $webclient = New-Object System.Net.WebClient $webclient.DownloadFile($GET_PIP_URL, $GET_PIP_PATH) Write-Host "Executing:" $python_path $GET_PIP_PATH & $python_path $GET_PIP_PATH } else { Write-Host "pip already installed." } } function DownloadMiniconda ($python_version, $platform_suffix) { if ($python_version -eq "3.4") { $filename = "Miniconda3-3.5.5-Windows-" + $platform_suffix + ".exe" } else { $filename = "Miniconda-3.5.5-Windows-" + $platform_suffix + ".exe" } $url = $MINICONDA_URL + $filename $filepath = Download $filename $url return $filepath } function InstallMiniconda ($python_version, $architecture, $python_home) { Write-Host "Installing Python" $python_version "for" $architecture "bit architecture to" $python_home if (Test-Path $python_home) { Write-Host $python_home "already exists, skipping." return $false } if ($architecture -eq "32") { $platform_suffix = "x86" } else { $platform_suffix = "x86_64" } $filepath = DownloadMiniconda $python_version $platform_suffix Write-Host "Installing" $filepath "to" $python_home $install_log = $python_home + ".log" $args = "/S /D=$python_home" Write-Host $filepath $args Start-Process -FilePath $filepath -ArgumentList $args -Wait -Passthru if (Test-Path $python_home) { Write-Host "Python $python_version ($architecture) installation complete" } else { Write-Host "Failed to install Python in $python_home" Get-Content -Path $install_log Exit 1 } } function InstallMinicondaPip ($python_home) { $pip_path = $python_home + "\Scripts\pip.exe" $conda_path = $python_home + "\Scripts\conda.exe" if (-not(Test-Path $pip_path)) { Write-Host "Installing pip..." $args = "install --yes pip" Write-Host $conda_path $args Start-Process -FilePath "$conda_path" -ArgumentList $args -Wait -Passthru } else { Write-Host "pip already installed." } } function main () { InstallPython $env:PYTHON_VERSION $env:PYTHON_ARCH $env:PYTHON InstallPip $env:PYTHON } main scikit-learn-0.17.0/continuous_integration/appveyor/requirements.txt000066400000000000000000000011541261700307500261300ustar00rootroot00000000000000# Fetch numpy and scipy wheels from the sklearn rackspace wheelhouse. # Those wheels were collected from http://www.lfd.uci.edu/~gohlke/pythonlibs/ # This is a temporary solution. As soon as numpy and scipy provide official # wheel for windows we ca delete this --find-links line. --find-links http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/ # fix the versions of numpy to force the use of numpy and scipy to use the whl # of the rackspace folder instead of trying to install from more recent # source tarball published on PyPI numpy==1.9.3 scipy==0.16.0 nose wheel wheelhouse_uploader scikit-learn-0.17.0/continuous_integration/appveyor/run_with_env.cmd000066400000000000000000000064461261700307500260510ustar00rootroot00000000000000:: To build extensions for 64 bit Python 3, we need to configure environment :: variables to use the MSVC 2010 C++ compilers from GRMSDKX_EN_DVD.iso of: :: MS Windows SDK for Windows 7 and .NET Framework 4 (SDK v7.1) :: :: To build extensions for 64 bit Python 2, we need to configure environment :: variables to use the MSVC 2008 C++ compilers from GRMSDKX_EN_DVD.iso of: :: MS Windows SDK for Windows 7 and .NET Framework 3.5 (SDK v7.0) :: :: 32 bit builds, and 64-bit builds for 3.5 and beyond, do not require specific :: environment configurations. :: :: Note: this script needs to be run with the /E:ON and /V:ON flags for the :: cmd interpreter, at least for (SDK v7.0) :: :: More details at: :: https://github.com/cython/cython/wiki/64BitCythonExtensionsOnWindows :: http://stackoverflow.com/a/13751649/163740 :: :: Author: Olivier Grisel :: License: CC0 1.0 Universal: http://creativecommons.org/publicdomain/zero/1.0/ :: :: Notes about batch files for Python people: :: :: Quotes in values are literally part of the values: :: SET FOO="bar" :: FOO is now five characters long: " b a r " :: If you don't want quotes, don't include them on the right-hand side. :: :: The CALL lines at the end of this file look redundant, but if you move them :: outside of the IF clauses, they do not run properly in the SET_SDK_64==Y :: case, I don't know why. @ECHO OFF SET COMMAND_TO_RUN=%* SET WIN_SDK_ROOT=C:\Program Files\Microsoft SDKs\Windows SET WIN_WDK=c:\Program Files (x86)\Windows Kits\10\Include\wdf :: Extract the major and minor versions, and allow for the minor version to be :: more than 9. This requires the version number to have two dots in it. SET MAJOR_PYTHON_VERSION=%PYTHON_VERSION:~0,1% IF "%PYTHON_VERSION:~3,1%" == "." ( SET MINOR_PYTHON_VERSION=%PYTHON_VERSION:~2,1% ) ELSE ( SET MINOR_PYTHON_VERSION=%PYTHON_VERSION:~2,2% ) :: Based on the Python version, determine what SDK version to use, and whether :: to set the SDK for 64-bit. IF %MAJOR_PYTHON_VERSION% == 2 ( SET WINDOWS_SDK_VERSION="v7.0" SET SET_SDK_64=Y ) ELSE ( IF %MAJOR_PYTHON_VERSION% == 3 ( SET WINDOWS_SDK_VERSION="v7.1" IF %MINOR_PYTHON_VERSION% LEQ 4 ( SET SET_SDK_64=Y ) ELSE ( SET SET_SDK_64=N IF EXIST "%WIN_WDK%" ( :: See: https://connect.microsoft.com/VisualStudio/feedback/details/1610302/ REN "%WIN_WDK%" 0wdf ) ) ) ELSE ( ECHO Unsupported Python version: "%MAJOR_PYTHON_VERSION%" EXIT 1 ) ) IF %PYTHON_ARCH% == 64 ( IF %SET_SDK_64% == Y ( ECHO Configuring Windows SDK %WINDOWS_SDK_VERSION% for Python %MAJOR_PYTHON_VERSION% on a 64 bit architecture SET DISTUTILS_USE_SDK=1 SET MSSdk=1 "%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Setup\WindowsSdkVer.exe" -q -version:%WINDOWS_SDK_VERSION% "%WIN_SDK_ROOT%\%WINDOWS_SDK_VERSION%\Bin\SetEnv.cmd" /x64 /release ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) ELSE ( ECHO Using default MSVC build environment for 64 bit architecture ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) ) ELSE ( ECHO Using default MSVC build environment for 32 bit architecture ECHO Executing: %COMMAND_TO_RUN% call %COMMAND_TO_RUN% || EXIT 1 ) scikit-learn-0.17.0/continuous_integration/install.sh000077500000000000000000000041761261700307500230130ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "install" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # The behavior of the script is controlled by environment variabled defined # in the .travis.yml in the top level folder of the project. # License: 3-clause BSD set -e # Fix the compilers to workaround avoid having the Python 3.4 build # lookup for g++44 unexpectedly. export CC=gcc export CXX=g++ if [[ "$DISTRIB" == "conda" ]]; then # Deactivate the travis-provided virtual environment and setup a # conda-based environment instead deactivate # Use the miniconda installer for faster download / install of conda # itself wget http://repo.continuum.io/miniconda/Miniconda-3.6.0-Linux-x86_64.sh \ -O miniconda.sh chmod +x miniconda.sh && ./miniconda.sh -b export PATH=/home/travis/miniconda/bin:$PATH conda update --yes conda # Configure the conda environment and put it in the path using the # provided versions conda create -n testenv --yes python=$PYTHON_VERSION pip nose \ numpy=$NUMPY_VERSION scipy=$SCIPY_VERSION source activate testenv if [[ "$INSTALL_MKL" == "true" ]]; then # Make sure that MKL is used conda install --yes mkl else # Make sure that MKL is not used conda remove --yes --features mkl || echo "MKL not installed" fi elif [[ "$DISTRIB" == "ubuntu" ]]; then # At the time of writing numpy 1.9.1 is included in the travis # virtualenv but we want to used numpy installed through apt-get # install. deactivate # Create a new virtualenv using system site packages for numpy and scipy virtualenv --system-site-packages testvenv source testvenv/bin/activate pip install nose fi if [[ "$COVERAGE" == "true" ]]; then pip install coverage coveralls fi # Build scikit-learn in the install.sh script to collapse the verbose # build output in the travis output when it succeeds. python --version python -c "import numpy; print('numpy %s' % numpy.__version__)" python -c "import scipy; print('scipy %s' % scipy.__version__)" python setup.py build_ext --inplace scikit-learn-0.17.0/continuous_integration/test_script.sh000066400000000000000000000016761261700307500237070ustar00rootroot00000000000000#!/bin/bash # This script is meant to be called by the "script" step defined in # .travis.yml. See http://docs.travis-ci.com/ for more details. # The behavior of the script is controlled by environment variabled defined # in the .travis.yml in the top level folder of the project. # License: 3-clause BSD set -e python --version python -c "import numpy; print('numpy %s' % numpy.__version__)" python -c "import scipy; print('scipy %s' % scipy.__version__)" # Skip tests that require large downloads over the network to save bandwith # usage as travis workers are stateless and therefore traditional local # disk caching does not work. export SKLEARN_SKIP_NETWORK_TESTS=1 # Do not use "make test" or "make test-coverage" as they enable verbose mode # which renders travis output too slow to display in a browser. if [[ "$COVERAGE" == "true" ]]; then nosetests -s --with-coverage sklearn else nosetests -s sklearn fi make test-doc test-sphinxext scikit-learn-0.17.0/continuous_integration/windows/000077500000000000000000000000001261700307500224705ustar00rootroot00000000000000scikit-learn-0.17.0/continuous_integration/windows/windows_testing_downloader.ps1000066400000000000000000000236051261700307500305700ustar00rootroot00000000000000# Author: Kyle Kastner # License: BSD 3 clause # This script is a helper to download the base python, numpy, and scipy # packages from their respective websites. # To quickly execute the script, run the following Powershell command: # powershell.exe -ExecutionPolicy unrestricted "iex ((new-object net.webclient).DownloadString('https://raw.githubusercontent.com/scikit-learn/scikit-learn/master/continuous_integration/windows/windows_testing_downloader.ps1'))" # This is a stopgap solution to make Windows testing easier # until Windows CI issues are resolved. # Rackspace's default Windows VMs have several security features enabled by default. # The DisableInternetExplorerESC function disables a feature which # prevents any webpage from opening without explicit permission. # This is a default setting of Windows VMs on Rackspace, and makes it annoying to # download other packages to test! # Powershell scripts are also disabled by default. One must run the command: # set-executionpolicy unrestricted # from a Powershell terminal with administrator rights to enable scripts. # To start an administrator Powershell terminal, right click second icon from the left on Windows Server 2012's bottom taskbar. param ( [string]$python = "None", [string]$nogit = "False" ) function DisableInternetExplorerESC { # Disables InternetExplorerESC to enable easier manual downloads of testing packages. # http://stackoverflow.com/questions/9368305/disable-ie-security-on-windows-server-via-powershell $AdminKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A7-37EF-4b3f-8CFC-4F3A74704073}" $UserKey = "HKLM:\SOFTWARE\Microsoft\Active Setup\Installed Components\{A509B1A8-37EF-4b3f-8CFC-4F3A74704073}" Set-ItemProperty -Path $AdminKey -Name "IsInstalled" -Value 0 Set-ItemProperty -Path $UserKey -Name "IsInstalled" -Value 0 Stop-Process -Name Explorer Write-Host "IE Enhanced Security Configuration (ESC) has been disabled." -ForegroundColor Green } function DownloadPackages ($package_dict, $append_string) { $webclient = New-Object System.Net.WebClient ForEach ($key in $package_dict.Keys) { $url = $package_dict[$key] $file = $key + $append_string if ($url -match "(\.*exe$)") { $file = $file + ".exe" } elseif ($url -match "(\.*msi$)") { $file = $file + ".msi" } else { $file = $file + ".py" } $basedir = $pwd.Path + "\" $filepath = $basedir + $file Write-Host "Downloading" $file "from" $url # Retry up to 5 times in case of network transient errors. $retry_attempts = 5 for($i=0; $i -lt $retry_attempts; $i++){ try{ $webclient.DownloadFile($url, $filepath) break } Catch [Exception]{ Start-Sleep 1 } } Write-Host "File saved at" $filepath } } function InstallPython($match_string) { $pkg_regex = "python" + $match_string + "*" $pkg = Get-ChildItem -Filter $pkg_regex -Name Invoke-Expression -Command "msiexec /qn /i $pkg" Write-Host "Installing Python" Start-Sleep 25 Write-Host "Python installation complete" } function InstallPip($match_string, $python_version) { $pkg_regex = "get-pip" + $match_string + "*" $py = $python_version -replace "\." $pkg = Get-ChildItem -Filter $pkg_regex -Name $python_path = "C:\Python" + $py + "\python.exe" Invoke-Expression -Command "$python_path $pkg" } function EnsurePip($python_version) { $py = $python_version -replace "\." $python_path = "C:\Python" + $py + "\python.exe" Invoke-Expression -Command "$python_path -m ensurepip" } function GetPythonHome($python_version) { $py = $python_version -replace "\." $pypath = "C:\Python" + $py + "\" return $pypath } function GetPipPath($python_version) { $py = $python_version -replace "\." $pypath = GetPythonHome $python_version if ($py.StartsWith("3")) { $pip = $pypath + "Scripts\pip3.exe" } else { $pip = $pypath + "Scripts\pip.exe" } return $pip } function PipInstall($pkg_name, $python_version, $extra_args) { $pip = GetPipPath $python_version Invoke-Expression -Command "$pip install $pkg_name" } function InstallNose($python_version) { PipInstall "nose" $python_version } function WheelInstall($name, $url, $python_version) { $pip = GetPipPath $python_version $args = "install --use-wheel --no-index" Invoke-Expression -Command "$pip $args $url $name" } function InstallWheel($python_version) { PipInstall "virtualenv" $python_version PipInstall "wheel" $python_version } function InstallNumpy($package_dict, $python_version) { #Don't pass name so we can use URL directly. WheelInstall "" $package_dict["numpy"] $python_version } function InstallScipy($package_dict, $python_version) { #Don't pass name so we can use URL directly. WheelInstall "" $package_dict["scipy"] $python_version } function InstallGit { $pkg_regex = "git*" $pkg = Get-ChildItem -Filter $pkg_regex -Name $pkg_cmd = $pwd.ToString() + "\" + $pkg + " /verysilent" Invoke-Expression -Command $pkg_cmd Write-Host "Installing Git" Start-Sleep 20 # Remove the installer - seems to cause weird issues with Git Bash Invoke-Expression -Command "rm git.exe" Write-Host "Git installation complete" } function ReadAndUpdateFromRegistry { # http://stackoverflow.com/questions/14381650/how-to-update-windows-powershell-session-environment-variables-from-registry foreach($level in "Machine","User") { [Environment]::GetEnvironmentVariables($level).GetEnumerator() | % { # For Path variables, append the new values, if they're not already in there if($_.Name -match 'Path$') { $_.Value = ($((Get-Content "Env:$($_.Name)") + ";$($_.Value)") -split ';' | Select -unique) -join ';' } $_ } | Set-Content -Path { "Env:$($_.Name)" } } } function UpdatePaths($python_version) { #This function makes local path updates required in order to install Python and supplementary packages in a single shell. $pypath = GetPythonHome $python_version $env:PATH = $env:PATH + ";" + $pypath $env:PYTHONPATH = $pypath + "DLLs;" + $pypath + "Lib;" + $pypath + "Lib\site-packages" $env:PYTHONHOME = $pypath Write-Host "PYTHONHOME temporarily set to" $env:PYTHONHOME Write-Host "PYTHONPATH temporarily set to" $env:PYTHONPATH Write-Host "PATH temporarily set to" $env:PATH } function Python27URLs { # Function returns a dictionary of packages to download for Python 2.7. $urls = @{ "python" = "https://www.python.org/ftp/python/2.7.7/python-2.7.7.msi" "numpy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/numpy-1.8.1-cp27-none-win32.whl" "scipy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/scipy-0.14.0-cp27-none-win32.whl" "get-pip" = "https://bootstrap.pypa.io/get-pip.py" } return $urls } function Python34URLs { # Function returns a dictionary of packages to download for Python 3.4. $urls = @{ "python" = "https://www.python.org/ftp/python/3.4.1/python-3.4.1.msi" "numpy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/numpy-1.8.1-cp34-none-win32.whl" "scipy" = "http://28daf2247a33ed269873-7b1aad3fab3cc330e1fd9d109892382a.r6.cf2.rackcdn.com/scipy-0.14.0-cp34-none-win32.whl" } return $urls } function GitURLs { # Function returns a dictionary of packages to download for Git $urls = @{ "git" = "https://github.com/msysgit/msysgit/releases/download/Git-1.9.4-preview20140611/Git-1.9.4-preview20140611.exe" } return $urls } function main { $versions = @{ "2.7" = Python27URLs "3.4" = Python34URLs } if ($nogit -eq "False") { Write-Host "Downloading and installing Gitbash" $urls = GitURLs DownloadPackages $urls "" InstallGit ".exe" } if (($python -eq "None")) { Write-Host "Installing all supported python versions" Write-Host "Current versions supported are:" ForEach ($key in $versions.Keys) { Write-Host $key $all_python += @($key) } } elseif(!($versions.ContainsKey($python))) { Write-Host "Python version not recognized!" Write-Host "Pass python version with -python" Write-Host "Current versions supported are:" ForEach ($key in $versions.Keys) { Write-Host $key } return } else { $all_python += @($python) } ForEach ($py in $all_python) { Write-Host "Installing Python" $py DisableInternetExplorerESC $pystring = $py -replace "\." $pystring = "_py" + $pystring $package_dict = $versions[$py] # This will download the whl packages as well which is # clunky but makes configuration simpler. DownloadPackages $package_dict $pystring UpdatePaths $py InstallPython $pystring ReadAndUpdateFromRegistry if ($package_dict.ContainsKey("get-pip")) { InstallPip $pystring $py } else { EnsurePip $py } InstallNose $py InstallWheel $py # The installers below here use wheel packages. # Wheels were created from CGohlke's installers with # wheel convert # These are hosted in Rackspace Cloud Files. InstallNumpy $package_dict $py InstallScipy $package_dict $py } return } main scikit-learn-0.17.0/doc/000077500000000000000000000000001261700307500147325ustar00rootroot00000000000000scikit-learn-0.17.0/doc/Makefile000066400000000000000000000070661261700307500164030ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD ?= sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml pickle json latex latexpdf changes linkcheck doctest optipng all: html-noplot help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* -rm -rf auto_examples/ -rm -rf generated/* -rm -rf modules/generated/* html: # These two lines make the build a bit more lengthy, and the # the embedding of images more robust rm -rf $(BUILDDIR)/html/_images #rm -rf _build/doctrees/ $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable" html-noplot: $(SPHINXBUILD) -D plot_gallery=0 -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." make -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." download-data: python -c "from sklearn.datasets.lfw import check_fetch_lfw; check_fetch_lfw()" # Optimize PNG files. Needs OptiPNG. Change the -P argument to the number of # cores you have available, so -P 64 if you have a real computer ;) optipng: find _build auto_examples */generated -name '*.png' -print0 \ | xargs -0 -n 1 -P 4 optipng -o10 dist: html latexpdf cp _build/latex/user_guide.pdf _build/html/stable/_downloads/scikit-learn-docs.pdf scikit-learn-0.17.0/doc/README000066400000000000000000000045371261700307500156230ustar00rootroot00000000000000Documentation for scikit-learn ------------------------------- This section contains the full manual and web page as displayed in http://scikit-learn.sf.net. To generate the full web page, including the example gallery (this might take a while): make html Or, if you'd rather not build the example gallery: make html-noplot That should create all the doc in directory _build/html To build the PDF manual, run make latexpdf Upload the generated doc to sourceforge --------------------------------------- First off, generate HTML documentation:: make html This should create a directory _build/html/stable with the documentation in html format. Next, make sure you have the PNG optimizer OptiPNG installed. The PNG files generated by Matplotlib tend to be ~20% too big, and they're costing us bandwidth. Then issue:: make optipng This may take some time. If you have a big machine at your disposal, check the ``Makefile``; it has a hint on how to speed up this target. Now can upload the generated HTML documentation using scp or some other SFTP clients. * Project web Hostname: web.sourceforge.net * Path: htdocs/ * Username: Combine your SourceForge.net Username with your SourceForge.net project UNIX name using a comma ( "," ); see below * Password: Your SourceForge.net Password An example session might look like the following for Username "jsmith" uploading a file for this project, using rsync with the right switch for the permissions: [jsmith@linux ~]$ rsync -rltvz --delete _build/html/stable/ \ > jsmith,scikit-learn@web.sourceforge.net:htdocs/dev -essh Connecting to web.sourceforge.net... The authenticity of host 'web.sourceforge.net (216.34.181.57)' can't be established. RSA key fingerprint is 68:b3:26:02:a0:07:e4:78:d4:ec:7f:2f:6a:4d:32:c5. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'web.sourceforge.net,216.34.181.57' (RSA) to the list of known hosts. jsmith,fooproject@web.sourceforge.net's password: sending incremental file list ... Development documentation automated build ----------------------------------------- A Rackspace cloud server named 'docbuilder' is continuously building the master branch to update the http://scikit-learn.org/dev tree of the website. The configuration of this server is managed at: http://github.com/scikit-learn/sklearn-docbuilder scikit-learn-0.17.0/doc/about.rst000066400000000000000000000165011261700307500166010ustar00rootroot00000000000000 About us ======== .. include:: ../AUTHORS.rst .. _citing-scikit-learn: Citing scikit-learn ------------------- If you use scikit-learn in a scientific publication, we would appreciate citations to the following paper: `Scikit-learn: Machine Learning in Python `_, Pedregosa *et al.*, JMLR 12, pp. 2825-2830, 2011. Bibtex entry:: @article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} } If you want to cite scikit-learn for its API or design, you may also want to consider the following paper: `API design for machine learning software: experiences from the scikit-learn project `_, Buitinck *et al.*, 2013. Bibtex entry:: @inproceedings{sklearn_api, author = {Lars Buitinck and Gilles Louppe and Mathieu Blondel and Fabian Pedregosa and Andreas Mueller and Olivier Grisel and Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort and Jaques Grobler and Robert Layton and Jake VanderPlas and Arnaud Joly and Brian Holt and Ga{\"{e}}l Varoquaux}, title = {{API} design for machine learning software: experiences from the scikit-learn project}, booktitle = {ECML PKDD Workshop: Languages for Data Mining and Machine Learning}, year = {2013}, pages = {108--122}, } Artwork ------- High quality PNG and SVG logos are available in the `doc/logos/ `_ source directory. .. image:: images/scikit-learn-logo-notext.png :align: center Funding ------- `INRIA `_ actively supports this project. It has provided funding for Fabian Pedregosa (2010-2012), Jaques Grobler (2012-2013) and Olivier Grisel (2013-2015) to work on this project full-time. It also hosts coding sprints and other events. .. image:: images/inria-logo.jpg :width: 200pt :align: center `Paris-Saclay Center for Data Science `_ funded one year for a developer to work on the project full-time (2014-2015). .. image:: images/cds-logo.png :width: 200pt :align: center The following students were sponsored by `Google `_ to work on scikit-learn through the `Google Summer of Code `_ program. - 2007 - David Cournapeau - 2011 - `Vlad Niculae`_ - 2012 - `Vlad Niculae`_, Immanuel Bayer. - 2013 - `Kemal Eren`_, Nicolas Trésegnie - 2014 - Hamzeh Alsalhi, Issam Laradji, Maheshakya Wijewardena, Manoj Kumar. It also provided funding for sprints and events around scikit-learn. If you would like to participate in the next Google Summer of code program, please see `this page `_ The `NeuroDebian `_ project providing `Debian `_ packaging and contributions is supported by `Dr. James V. Haxby `_ (`Dartmouth College `_). The `PSF `_ helped find and manage funding for our 2011 Granada sprint. More information can be found `here `__ `tinyclues `_ funded the 2011 international Granada sprint. Donating to the project ~~~~~~~~~~~~~~~~~~~~~~~ If you are interested in donating to the project or to one of our code-sprints, you can use the *Paypal* button below or the `NumFOCUS Donations Page `_ (if you use the latter, please indicate that you are donating for the scikit-learn project). All donations will be handled by `NumFOCUS `_, a non-profit-organization which is managed by a board of `Scipy community members `_. NumFOCUS's mission is to foster scientific computing software, in particular in Python. As a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available while in compliance with tax regulations. The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as well as towards the organization budget of the project [#f1]_. .. raw :: html


.. rubric:: Notes .. [#f1] Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS, hosting or continuous integration services. The 2013' Paris international sprint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |center-div| |telecom| |tinyclues| |afpy| |FNRS| |end-div| .. |center-div| raw:: html
.. |telecom| image:: http://f.hypotheses.org/wp-content/blogs.dir/331/files/2011/03/Logo-TPT.jpg :width: 120pt :target: http://www.telecom-paristech.fr/ .. |tinyclues| image:: http://www.tinyclues.com/static/img/logo.png :width: 120pt :target: http://www.tinyclues.com/ .. |afpy| image:: http://www.afpy.org/logo.png :width: 120pt :target: http://www.afpy.org .. |SGR| image:: http://www.svi.cnrs-bellevue.fr/wikimedia/images/Logo_svi_inp.png :width: 120pt :target: http://www.svi.cnrs-bellevue.fr .. |FNRS| image:: http://www.fnrs.be/uploaddocs/images/COMMUNIQUER/FRS-FNRS_rose_transp.png :width: 120pt :target: http://www.frs-fnrs.be/ .. figure:: http://sites.uclouvain.be/dysco/pmwiki/uploads/Main/dysco.gif :width: 120pt :target: http://sites.uclouvain.be/dysco/ IAP VII/19 - DYSCO .. |end-div| raw:: html
*For more information on this sprint, see* `here `__ Infrastructure support ---------------------- - We would like to thank `Rackspace `_ for providing us with a free `Rackspace Cloud `_ account to automatically build the documentation and the example gallery from for the development version of scikit-learn using `this tool `_. - We would also like to thank `Shining Panda `_ for free CPU time on their Continuous Integration server. scikit-learn-0.17.0/doc/conf.py000066400000000000000000000203761261700307500162410ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # scikit-learn documentation build configuration file, created by # sphinx-quickstart on Fri Jan 8 09:13:42 2010. # # This file is execfile()d with the current directory set to its containing # dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. from __future__ import print_function import sys import os from sklearn.externals.six import u # If extensions (or modules to document with autodoc) are in another # directory, add these directories to sys.path here. If the directory # is relative to the documentation root, use os.path.abspath to make it # absolute, like shown here. sys.path.insert(0, os.path.abspath('sphinxext')) from github_link import make_linkcode_resolve # -- General configuration --------------------------------------------------- # Try to override the matplotlib configuration as early as possible try: import gen_rst except: pass # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['gen_rst', 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.pngmath', 'numpy_ext.numpydoc', 'sphinx.ext.linkcode', ] autosummary_generate = True autodoc_default_flags = ['members', 'inherited-members'] # Add any paths that contain templates here, relative to this directory. templates_path = ['templates'] # generate autosummary even if no references autosummary_generate = True # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8' # Generate the plots for the gallery plot_gallery = True # The master toctree document. master_doc = 'index' # General information about the project. project = u('scikit-learn') copyright = u('2010 - 2014, scikit-learn developers (BSD License)') # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. import sklearn version = sklearn.__version__ # The full version, including alpha/beta/rc tags. release = sklearn.__version__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of documents that shouldn't be included in the build. #unused_docs = [] # List of directories, relative to source directory, that shouldn't be # searched for source files. exclude_trees = ['_build', 'templates', 'includes'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. add_function_parentheses = False # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. Major themes that come with # Sphinx are currently 'default' and 'sphinxdoc'. html_theme = 'scikit-learn' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. html_theme_options = {'oldversion': False, 'collapsiblesidebar': True, 'google_analytics': True, 'surveybanner': False, 'sprintbanner': True} # Add any paths that contain custom themes here, relative to this directory. html_theme_path = ['themes'] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. html_short_title = 'scikit-learn' # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = 'logos/scikit-learn-logo-small.png' # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. html_favicon = 'logos/favicon.ico' # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['images'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. html_domain_indices = False # If false, no index is generated. html_use_index = False # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = '' # Output file base name for HTML help builder. htmlhelp_basename = 'scikit-learndoc' # -- Options for LaTeX output ------------------------------------------------ # The paper size ('letter' or 'a4'). #latex_paper_size = 'letter' # The font size ('10pt', '11pt' or '12pt'). #latex_font_size = '10pt' # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass # [howto/manual]). latex_documents = [('index', 'user_guide.tex', u('scikit-learn user guide'), u('scikit-learn developers'), 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. latex_logo = "logos/scikit-learn-logo.png" # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # Additional stuff for the LaTeX preamble. latex_preamble = r""" \usepackage{amsmath}\usepackage{amsfonts}\usepackage{bm}\usepackage{morefloats} \usepackage{enumitem} \setlistdepth{10} """ # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. latex_domain_indices = False trim_doctests_flags = True def generate_example_rst(app, what, name, obj, options, lines): # generate empty examples files, so that we don't get # inclusion errors if there are no examples for a class / module examples_path = os.path.join(app.srcdir, "modules", "generated", "%s.examples" % name) if not os.path.exists(examples_path): # touch file open(examples_path, 'w').close() def setup(app): # to hide/show the prompt in code examples: app.add_javascript('js/copybutton.js') app.connect('autodoc-process-docstring', generate_example_rst) # The following is used by sphinx.ext.linkcode to provide links to github linkcode_resolve = make_linkcode_resolve('sklearn', u'https://github.com/scikit-learn/' 'scikit-learn/blob/{revision}/' '{package}/{path}#L{lineno}') scikit-learn-0.17.0/doc/data_transforms.rst000066400000000000000000000024071261700307500206560ustar00rootroot00000000000000.. include:: includes/big_toc_css.rst .. _data-transforms: Dataset transformations ----------------------- scikit-learn provides a library of transformers, which may clean (see :ref:`preprocessing`), reduce (see :ref:`data_reduction`), expand (see :ref:`kernel_approximation`) or generate (see :ref:`feature_extraction`) feature representations. Like other estimators, these are represented by classes with ``fit`` method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a ``transform`` method which applies this transformation model to unseen data. ``fit_transform`` may be more convenient and efficient for modelling and transforming the training data simultaneously. Combining such transformers, either in parallel or series is covered in :ref:`combining_estimators`. :ref:`metrics` covers transforming feature spaces into affinity matrices, while :ref:`preprocessing_targets` considers transformations of the target space (e.g. categorical labels) for use in scikit-learn. .. toctree:: modules/pipeline modules/feature_extraction modules/preprocessing modules/unsupervised_reduction modules/random_projection modules/kernel_approximation modules/metrics modules/preprocessing_targets scikit-learn-0.17.0/doc/datasets/000077500000000000000000000000001261700307500165425ustar00rootroot00000000000000scikit-learn-0.17.0/doc/datasets/covtype.rst000066400000000000000000000014071261700307500207670ustar00rootroot00000000000000 .. _covtype: Forest covertypes ================= The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch's cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the `dataset's homepage `_. Some of the features are boolean indicators, while others are discrete or continuous measurements. :func:`sklearn.datasets.fetch_covtype` will load the covertype dataset; it returns a dictionary-like object with the feature matrix in the ``data`` member and the target values in ``target``. The dataset will be downloaded from the web if necessary. scikit-learn-0.17.0/doc/datasets/index.rst000066400000000000000000000217701261700307500204120ustar00rootroot00000000000000.. _datasets: ========================= Dataset loading utilities ========================= .. currentmodule:: sklearn.datasets The ``sklearn.datasets`` package embeds some small toy datasets as introduced in the :ref:`Getting Started ` section. To evaluate the impact of the scale of the dataset (``n_samples`` and ``n_features``) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that comes from the 'real world'. General dataset API =================== There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for sample images, which is described below in the :ref:`sample_images` section. The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple ``(X, y)`` consisting of a ``n_samples`` * ``n_features`` numpy array ``X`` and an array of length ``n_samples`` containing the targets ``y``. The toy datasets as well as the 'real world' datasets and the datasets fetched from mldata.org have more sophisticated structure. These functions return a dictionary-like object holding at least two items: an array of shape ``n_samples`` * ``n_features`` with key ``data`` (except for 20newsgroups) and a numpy array of length ``n_samples``, containing the target values, with key ``target``. The datasets also contain a description in ``DESCR`` and some contain ``feature_names`` and ``target_names``. See the dataset descriptions below for details. Toy datasets ============ scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_boston load_iris load_diabetes load_digits load_linnerud These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks. .. _sample_images: Sample images ============= The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors. Those image can be useful to test algorithms and pipeline on 2D data. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_sample_images load_sample_image .. image:: ../auto_examples/cluster/images/plot_color_quantization_001.png :target: ../auto_examples/cluster/plot_color_quantization.html :scale: 30 :align: right .. warning:: The default coding of images is based on the ``uint8`` dtype to spare memory. Often machine learning algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use ``pylab.imshow`` don't forget to scale to the range 0 - 1 as done in the following example. .. topic:: Examples: * :ref:`example_cluster_plot_color_quantization.py` .. _sample_generators: Sample generators ================= In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. Generators for classification and clustering -------------------------------------------- These generators produce a matrix of features and corresponding discrete targets. Single label ~~~~~~~~~~~~ Both :func:`make_blobs` and :func:`make_classification` create multiclass datasets by allocating each class one or more normally-distributed clusters of points. :func:`make_blobs` provides greater control regarding the centers and standard deviations of each cluster, and is used to demonstrate clustering. :func:`make_classification` specialises in introducing noise by way of: correlated, redundant and uninformative features; multiple Gaussian clusters per class; and linear transformations of the feature space. :func:`make_gaussian_quantiles` divides a single Gaussian cluster into near-equal-size classes separated by concentric hyperspheres. :func:`make_hastie_10_2` generates a similar binary, 10-dimensional problem. .. image:: ../auto_examples/datasets/images/plot_random_dataset_001.png :target: ../auto_examples/datasets/plot_random_dataset.html :scale: 50 :align: center :func:`make_circles` and :func:`make_moons` generate 2d binary classification datasets that are challenging to certain algorithms (e.g. centroid-based clustering or linear classification), including optional Gaussian noise. They are useful for visualisation. produces Gaussian data with a spherical decision boundary for binary classification. Multilabel ~~~~~~~~~~ :func:`make_multilabel_classification` generates random samples with multiple labels, reflecting a bag of words drawn from a mixture of topics. The number of topics for each document is drawn from a Poisson distribution, and the topics themselves are drawn from a fixed random distribution. Similarly, the number of words is drawn from Poisson, with words drawn from a multinomial, where each topic defines a probability distribution over words. Simplifications with respect to true bag-of-words mixtures include: * Per-topic word distributions are independently drawn, where in reality all would be affected by a sparse base distribution, and would be correlated. * For a document generated from multiple topics, all topics are weighted equally in generating its bag of words. * Documents without labels words at random, rather than from a base distribution. .. image:: ../auto_examples/datasets/images/plot_random_multilabel_dataset_001.png :target: ../auto_examples/datasets/plot_random_multilabel_dataset.html :scale: 50 :align: center Biclustering ~~~~~~~~~~~~ .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_biclusters make_checkerboard Generators for regression ------------------------- :func:`make_regression` produces regression targets as an optionally-sparse random linear combination of random features, with noise. Its informative features may be uncorrelated, or low rank (few features account for most of the variance). Other regression generators generate functions deterministically from randomized features. :func:`make_sparse_uncorrelated` produces a target as a linear combination of four features with fixed coefficients. Others encode explicitly non-linear relations: :func:`make_friedman1` is related by polynomial and sine transforms; :func:`make_friedman2` includes feature multiplication and reciprocation; and :func:`make_friedman3` is similar with an arctan transformation on the target. Generators for manifold learning -------------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_s_curve make_swiss_roll Generators for decomposition ---------------------------- .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_low_rank_matrix make_sparse_coded_signal make_spd_matrix make_sparse_spd_matrix .. _libsvm_loader: Datasets in svmlight / libsvm format ==================================== scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line takes the form ``