pax_global_header00006660000000000000000000000064122007567530014521gustar00rootroot0000000000000052 comment=cd6d7d22ed9af090fdf59623f1770eaeaffecbcc scikit-learn-0.14.1/000077500000000000000000000000001220075675300141715ustar00rootroot00000000000000scikit-learn-0.14.1/.gitattributes000066400000000000000000000016041220075675300170650ustar00rootroot00000000000000/sklearn/__check_build/_check_build.c -diff /sklearn/_hmmc.c -diff /sklearn/_isotonic.c -diff /sklearn/cluster/_hierarchical.c -diff /sklearn/cluster/_k_means.c -diff /sklearn/datasets/_svmlight_format.c -diff /sklearn/ensemble/_gradient_boosting.c -diff /sklearn/feature_extraction/_hashing.c -diff /sklearn/linear_model/cd_fast.c -diff /sklearn/linear_model/sgd_fast.c -diff /sklearn/metrics/pairwise_fast.c -diff /sklearn/neighbors/ball_tree.c -diff /sklearn/svm/liblinear.c -diff /sklearn/svm/libsvm.c -diff /sklearn/svm/libsvm_sparse.c -diff /sklearn/tree/_tree.c -diff /sklearn/utils/arraybuilder.c -diff /sklearn/utils/arrayfuncs.c -diff /sklearn/utils/graph_shortest_path.c -diff /sklearn/utils/lgamma.c -diff /sklearn/utils/_logistic_sigmoid.c -diff /sklearn/utils/murmurhash.c -diff /sklearn/utils/seq_dataset.c -diff /sklearn/utils/sparsefuncs.c -diff /sklearn/utils/weight_vector.c -diff scikit-learn-0.14.1/.gitignore000066400000000000000000000011071220075675300161600ustar00rootroot00000000000000*.pyc *.so *.pyd *~ .#* *.lprof *.swp *.swo .DS_Store build sklearn/datasets/__config__.py sklearn/**/*.html dist/ MANIFEST doc/_build/ doc/auto_examples/ doc/modules/generated/ doc/datasets/generated/ *.pdf pip-log.txt scikit_learn.egg-info/ .coverage coverage tags covtype.data.gz 20news-18828/ 20news-18828.tar.gz coverages.zip samples.zip doc/coverages.zip doc/samples.zip coverages samples doc/coverages doc/samples *.prof lfw_preprocessed/ nips2010_pdf/ *.nt.bz2 *.tar.gz *.tgz examples/cluster/joblib reuters/ benchmarks/bench_covertype_data/ *.prefs .pydevproject .idea scikit-learn-0.14.1/.mailmap000066400000000000000000000126731220075675300156230ustar00rootroot00000000000000Alexandre Gramfort Alexandre Gramfort Alexandre Gramfort Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Andreas Mueller Arnaud Joly Arnaud Joly Arnaud Joly Anne-Laure Fouque Ariel Rokem arokem Bertrand Thirion Brian Cheung Brian Cheung Brian Cheung Brian Holt Christian Osendorfer Clay Woolam Denis Engemann Denis Engemann Denis Engemann Denis Engemann Diego Molla DraXus draxus Edouard DUCHESNAY Edouard DUCHESNAY Edouard DUCHESNAY Emmanuelle Gouillart Emmanuelle Gouillart Fabian Pedregosa Fabian Pedregosa Fabian Pedregosa Federico Vaggi Federico Vaggi Gael Varoquaux Gael Varoquaux Gael Varoquaux Gilles Louppe Harikrishnan S Hrishikesh Huilgolkar Immanuel Bayer Jake VanderPlas Jake VanderPlas Jake VanderPlas James Bergstra Jaques Grobler Jan Schlüter Jean Kossaifi Jean Kossaifi Jean Kossaifi Joel Nothman Lars Buitinck Lars Buitinck Lars Buitinck Matthieu Perrot Michael Eickenberg Michael Eickenberg Nelle Varoquaux Nelle Varoquaux Nicolas Pinto Noel Dawe Noel Dawe Olivier Grisel Olivier Grisel Olivier Hervieu Peter Prettenhofer Robert Layton Roman Sinayev Roman Sinayev Satrajit Ghosh Shiqiao Du Shiqiao Du Tim Sheerman-Chase Vincent Dubourg Vincent Dubourg Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Michel Vincent Schut Virgile Fritsch Virgile Fritsch Vlad Niculae Wei Li Wei Li X006 Xinfan Meng Yannick Schwartz Yannick Schwartz scikit-learn-0.14.1/.travis.yml000066400000000000000000000003651220075675300163060ustar00rootroot00000000000000language: python python: - "2.7" virtualenv: system_site_packages: true before_install: - sudo apt-get update -qq - sudo apt-get install -qq python-scipy python-nose install: python setup.py build_ext --inplace script: make test scikit-learn-0.14.1/AUTHORS.rst000066400000000000000000000040471220075675300160550ustar00rootroot00000000000000.. -*- mode: rst -*- This is a community effort, and as such many people have contributed to it over the years. History ------- This project was started in 2007 as a Google Summer of Code project by David Cournapeau. Later that year, Matthieu Brucher started work on this project as part of his thesis. In 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel of INRIA took leadership of the project and made the first public release, February the 1st 2010. Since then, several releases have appeared following a ~3 month cycle, and a striving international community has been leading the development. People ------ .. hlist:: * David Cournapeau * Jarrod Millman * `Matthieu Brucher `_ * `Fabian Pedregosa `_ * `Gael Varoquaux `_ * `Jake VanderPlas `_ * `Alexandre Gramfort `_ * `Olivier Grisel `_ * Bertrand Thirion * Vincent Michel * Chris Filo Gorgolewski * `Angel Soler Gollonet `_ * `Yaroslav Halchenko `_ * Ron Weiss * `Virgile Fritsch `_ * `Mathieu Blondel `_ * `Peter Prettenhofer `_ * Vincent Dubourg * `Alexandre Passos `_ * `Vlad Niculae `_ * Edouard Duchesnay * Thouis (Ray) Jones * Lars Buitinck * Paolo Losi * Nelle Varoquaux * `Brian Holt `_ * Robert Layton * `Gilles Louppe `_ * `Andreas Müller `_ (release manager) * `Satra Ghosh `_ * `Wei Li `_ * `Arnaud Joly `_ * `Kemal Eren `_ scikit-learn-0.14.1/CONTRIBUTING.md000066400000000000000000000124141220075675300164240ustar00rootroot00000000000000 Contributing code ================= **Note: This document is just to get started, visit [**Contributing page**](http://scikit-learn.org/stable/developers/index.html#coding-guidelines) for the full contributor's guide. Please be sure to read it carefully to make the code review process go as smoothly as possible and maximize the likelihood of your contribution being merged.** How to contribute ----------------- The preferred way to contribute to scikit-learn is to fork the [main repository](http://github.com/scikit-learn/scikit-learn/) on GitHub: 1. Fork the [project repository](http://github.com/scikit-learn/scikit-learn): click on the 'Fork' button near the top of the page. This creates a copy of the code under your account on the GitHub server. 2. Clone this copy to your local disk: $ git clone git@github.com:YourLogin/scikit-learn.git 3. Create a branch to hold your changes: $ git checkout -b my-feature and start making changes. Never work in the ``master`` branch! 4. Work on this copy on your computer using Git to do the version control. When you're done editing, do: $ git add modified_files $ git commit to record your changes in Git, then push them to GitHub with: $ git push -u origin my-feature Finally, go to the web page of the your fork of the scikit-learn repo, and click 'Pull request' to send your changes to the maintainers for review. request. This will send an email to the committers. (If any of the above seems like magic to you, then look up the [Git documentation](http://git-scm.com/documentation) on the web.) It is recommended to check that your contribution complies with the following rules before submitting a pull request: - All public methods should have informative docstrings with sample usage presented as doctests when appropriate. - All other tests pass when everything is rebuilt from scratch. On Unix-like systems, check with (from the toplevel source folder): $ make - When adding additional functionality, provide at least one example script in the ``examples/`` folder. Have a look at other examples for reference. Examples should demonstrate why the new functionality is useful in practice and, if possible, compare it to other methods available in scikit-learn. - At least one paragraph of narrative documentation with links to ```` references in the literature (with PDF links when possible) and the example. The documentation should also include expected time and space complexity of the algorithm and scalability, e.g. "this algorithm can scale to a large number of samples > 100000, but does not scale in dimensionality: n_features is expected to be lower than 100". You can also check for common programming errors with the following tools: - Code with good unittest coverage (at least 80%), check with: $ pip install nose coverage $ nosetests --with-coverage path/to/tests_for_package - No pyflakes warnings, check with: $ pip install pyflakes $ pyflakes path/to/module.py - No PEP8 warnings, check with: $ pip install pep8 $ pep8 path/to/module.py - AutoPEP8 can help you fix some of the easy redundant errors: $ pip install autopep8 $ autopep8 path/to/pep8.py Bonus points for contributions that include a performance analysis with a benchmark script and profiling output (please report on the mailing list or on the GitHub issue). Easy Issues ----------- A great way to start contributing to scikit-learn is to pick an item from the list of [Easy issues](https://github.com/scikit-learn/scikit-learn/issues?labels=Easy) in the issue tracker. Resolving these issues allow you to start contributing to the project without much prior knowledge. Your assistance in this area will be greatly appreciated by the more experienced developers as it helps free up their time to concentrate on other issues. Documentation ------------- We are glad to accept any sort of documentation: function docstrings, reStructuredText documents (like this one), tutorials, etc. reStructuredText documents live in the source code repository under the doc/ directory. You can edit the documentation using any text editor and then generate the HTML output by typing ``make html`` from the doc/ directory. Alternatively, ``make`` can be used to quickly generate the documentation without the example gallery. The resulting HTML files will be placed in _build/html/ and are viewable in a web browser. See the README file in the doc/ directory for more information. For building the documentation, you will need [sphinx](http://sphinx.pocoo.org/) and [matplotlib](http://matplotlib.sourceforge.net/). When you are writing documentation, it is important to keep a good compromise between mathematical and algorithmic details, and give intuition to the reader on what the algorithm does. It is best to always start with a small paragraph with a hand-waving explanation of what the method does to the data and a figure (coming from an example) illustrating it. Further Information ------------------- Visit the [Contributing Code](http://scikit-learn.org/stable/developers/index.html#coding-guidelines) section of the website for more information including conforming to the API spec and profiling contributed code. scikit-learn-0.14.1/COPYING000066400000000000000000000030271220075675300152260ustar00rootroot00000000000000New BSD License Copyright (c) 2007–2013 The scikit-learn developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: a. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. b. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. c. Neither the name of the Scikit-learn Developers nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. scikit-learn-0.14.1/MANIFEST.in000066400000000000000000000002651220075675300157320ustar00rootroot00000000000000include *.rst recursive-include doc * recursive-include examples * recursive-include sklearn *.c *.h *.pyx *.pxd recursive-include sklearn/datasets *.csv *.csv.gz *.rst *.jpg *.txt scikit-learn-0.14.1/Makefile000066400000000000000000000025361220075675300156370ustar00rootroot00000000000000# simple makefile to simplify repetetive build env management tasks under posix # caution: testing won't work on windows, see README PYTHON ?= python CYTHON ?= cython NOSETESTS ?= nosetests CTAGS ?= ctags all: clean inplace test clean-pyc: find sklearn -name "*.pyc" | xargs rm -f clean-so: find sklearn -name "*.so" | xargs rm -f find sklearn -name "*.pyd" | xargs rm -f find sklearn -name "__pycache__" | xargs rm -rf clean-build: rm -rf build rm -rf dist clean-ctags: rm -f tags clean: clean-build clean-pyc clean-so clean-ctags in: inplace # just a shortcut inplace: $(PYTHON) setup.py build_ext -i test-code: in $(NOSETESTS) -s -v sklearn test-doc: $(NOSETESTS) -s -v doc/ doc/modules/ doc/datasets/ \ doc/developers doc/tutorial/basic doc/tutorial/statistical_inference test-coverage: rm -rf coverage .coverage $(NOSETESTS) -s -v --with-coverage sklearn test: test-code test-doc trailing-spaces: find sklearn -name "*.py" | xargs perl -pi -e 's/[ \t]*$$//' cython: find sklearn -name "*.pyx" | xargs $(CYTHON) ctags: # make tags for symbol based navigation in emacs and vim # Install with: sudo apt-get install exuberant-ctags $(CTAGS) -R * doc: inplace make -C doc html doc-noplot: inplace make -C doc html-noplot code-analysis: flake8 sklearn | grep -v __init__ | grep -v external pylint -E -i y sklearn/ -d E1103,E0611,E1101 scikit-learn-0.14.1/README.rst000066400000000000000000000050311220075675300156570ustar00rootroot00000000000000.. -*- mode: rst -*- |Travis|_ .. |Travis| image:: https://api.travis-ci.org/scikit-learn/scikit-learn.png?branch=master .. _Travis: https://travis-ci.org/scikit-learn/scikit-learn scikit-learn ============ scikit-learn is a Python module for machine learning built on top of SciPy and distributed under the 3-Clause BSD license. The project was started in 2007 by David Cournapeau as a Google Summer of Code project, and since then many volunteers have contributed. See the AUTHORS.rst file for a complete list of contributors. It is currently maintained by a team of volunteers. **Note** `scikit-learn` was previously referred to as `scikits.learn`. Important links =============== - Official source code repo: https://github.com/scikit-learn/scikit-learn - HTML documentation (stable release): http://scikit-learn.org - HTML documentation (development version): http://scikit-learn.org/dev/ - Download releases: http://sourceforge.net/projects/scikit-learn/files/ - Issue tracker: https://github.com/scikit-learn/scikit-learn/issues - Mailing list: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general - IRC channel: ``#scikit-learn`` at ``irc.freenode.net`` Dependencies ============ scikit-learn is tested to work under Python 2.6+ and Python 3.3+ (using the same codebase thanks to an embedded copy of `six `_). The required dependencies to build the software Numpy >= 1.3, SciPy >= 0.7 and a working C/C++ compiler. For running the examples Matplotlib >= 0.99.1 is required and for running the tests you need nose >= 0.10. This configuration matches the Ubuntu 10.04 LTS release from April 2010. Install ======= This package uses distutils, which is the default way of installing python modules. To install in your home directory, use:: python setup.py install --user To install for all users on Unix/Linux:: python setup.py build sudo python setup.py install Development =========== Code ---- GIT ~~~ You can check the latest sources with the command:: git clone git://github.com/scikit-learn/scikit-learn.git or if you have write privileges:: git clone git@github.com:scikit-learn/scikit-learn.git Testing ------- After installation, you can launch the test suite from outside the source directory (you will need to have nosetests installed):: $ nosetests --exe sklearn See the web page http://scikit-learn.org/stable/install.html#testing for more information. Random number generation can be controlled during testing by setting the ``SKLEARN_SEED`` environment variable. scikit-learn-0.14.1/benchmarks/000077500000000000000000000000001220075675300163065ustar00rootroot00000000000000scikit-learn-0.14.1/benchmarks/bench_covertype.py000066400000000000000000000220561220075675300220440ustar00rootroot00000000000000""" =========================== Covertype dataset benchmark =========================== Benchmark stochastic gradient descent (SGD), Liblinear, and Naive Bayes, CART (decision tree), RandomForest and Extra-Trees on the forest covertype dataset of Blackard, Jock, and Dean [1]. The dataset comprises 581,012 samples. It is low dimensional with 54 features and a sparsity of approx. 23%. Here, we consider the task of predicting class 1 (spruce/fir). The classification performance of SGD is competitive with Liblinear while being two orders of magnitude faster to train:: [..] Classification performance: =========================== Classifier train-time test-time error-rate -------------------------------------------- liblinear 15.9744s 0.0705s 0.2305 GaussianNB 3.0666s 0.3884s 0.4841 SGD 1.0558s 0.1152s 0.2300 CART 79.4296s 0.0523s 0.0469 RandomForest 1190.1620s 0.5881s 0.0243 ExtraTrees 640.3194s 0.6495s 0.0198 The same task has been used in a number of papers including: * `"SVM Optimization: Inverse Dependence on Training Set Size" `_ S. Shalev-Shwartz, N. Srebro - In Proceedings of ICML '08. * `"Pegasos: Primal estimated sub-gradient solver for svm" `_ S. Shalev-Shwartz, Y. Singer, N. Srebro - In Proceedings of ICML '07. * `"Training Linear SVMs in Linear Time" `_ T. Joachims - In SIGKDD '06 [1] http://archive.ics.uci.edu/ml/datasets/Covertype """ from __future__ import division, print_function print(__doc__) # Author: Peter Prettenhofer # License: BSD 3 clause import logging import os import sys from time import time from optparse import OptionParser import numpy as np from sklearn.datasets import fetch_covtype from sklearn.svm import LinearSVC from sklearn.linear_model import SGDClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics from sklearn.externals.joblib import Memory logging.basicConfig(level=logging.INFO, format='%(asctime)s %(levelname)s %(message)s') logger = logging.getLogger(__name__) op = OptionParser() op.add_option("--classifiers", dest="classifiers", default='liblinear,GaussianNB,SGD,CART', help="comma-separated list of classifiers to benchmark. " "default: %default. available: " "liblinear,GaussianNB,SGD,CART,ExtraTrees,RandomForest") op.add_option("--n-jobs", dest="n_jobs", default=1, type=int, help="Number of concurrently running workers for models that" " support parallelism.") # Each number generator use the same seed to avoid coupling issue between # estimators. op.add_option("--random-seed", dest="random_seed", default=13, type=int, help="Common seed used by random number generator.") op.print_help() (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) # Memoize the data extraction and memory map the resulting # train / test splits in readonly mode bench_folder = os.path.dirname(__file__) original_archive = os.path.join(bench_folder, 'covtype.data.gz') joblib_cache_folder = os.path.join(bench_folder, 'bench_covertype_data') m = Memory(joblib_cache_folder, mmap_mode='r') # Load the data, then cache and memmap the train/test split @m.cache def load_data(dtype=np.float32, order='C'): ###################################################################### ## Load dataset print("Loading dataset...") data = fetch_covtype(download_if_missing=True, shuffle=True, random_state=opts.random_seed) X, y = data['data'], data['target'] X = np.asarray(X, dtype=dtype) if order.lower() == 'f': X = np.asfortranarray(X) # class 1 vs. all others. y[np.where(y != 1)] = -1 ###################################################################### ## Create train-test split (as [Joachims, 2006]) logger.info("Creating train-test split...") n_train = 522911 X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] ###################################################################### ## Standardize first 10 features (the numerical ones) mean = X_train.mean(axis=0) std = X_train.std(axis=0) mean[10:] = 0.0 std[10:] = 1.0 X_train = (X_train - mean) / std X_test = (X_test - mean) / std return X_train, X_test, y_train, y_test X_train, X_test, y_train, y_test = load_data() ###################################################################### ## Print dataset statistics print("") print("Dataset statistics:") print("===================") print("%s %d" % ("number of features:".ljust(25), X_train.shape[1])) print("%s %d" % ("number of classes:".ljust(25), np.unique(y_train).shape[0])) print("%s %s" % ("data type:".ljust(25), X_train.dtype)) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of train samples:".ljust(25), X_train.shape[0], np.sum(y_train == 1), np.sum(y_train == -1), int(X_train.nbytes / 1e6))) print("%s %d (pos=%d, neg=%d, size=%dMB)" % ("number of test samples:".ljust(25), X_test.shape[0], np.sum(y_test == 1), np.sum(y_test == -1), int(X_test.nbytes / 1e6))) classifiers = dict() ###################################################################### ## Benchmark classifiers def benchmark(clf): t0 = time() clf.fit(X_train, y_train) train_time = time() - t0 t0 = time() pred = clf.predict(X_test) test_time = time() - t0 err = metrics.zero_one_loss(y_test, pred, normalize=True) return err, train_time, test_time ###################################################################### ## Train Liblinear model liblinear_parameters = { 'loss': 'l2', 'penalty': 'l2', 'C': 1000, 'dual': False, 'tol': 1e-3, "random_state": opts.random_seed, } classifiers['liblinear'] = LinearSVC(**liblinear_parameters) ###################################################################### ## Train GaussianNB model classifiers['GaussianNB'] = GaussianNB() ###################################################################### ## Train SGD model sgd_parameters = { 'alpha': 0.001, 'n_iter': 2, 'n_jobs': opts.n_jobs, "random_state": opts.random_seed, } classifiers['SGD'] = SGDClassifier(**sgd_parameters) ###################################################################### ## Train CART model classifiers['CART'] = DecisionTreeClassifier(min_samples_split=5, max_depth=None, random_state=opts.random_seed) ###################################################################### ## Train RandomForest model rf_parameters = { "n_estimators": 20, "n_jobs": opts.n_jobs, "random_state": opts.random_seed, } classifiers['RandomForest'] = RandomForestClassifier(**rf_parameters) ###################################################################### ## Train Extra-Trees model classifiers['ExtraTrees'] = ExtraTreesClassifier(n_estimators=20, n_jobs=opts.n_jobs, random_state=opts.random_seed) ###################################################################### ## Train GBRT model classifiers['GBRT'] = GradientBoostingClassifier(n_estimators=250, random_state=opts.random_seed) selected_classifiers = opts.classifiers.split(',') for name in selected_classifiers: if name not in classifiers: op.error('classifier %r unknown' % name) sys.exit(1) print() print("Training Classifiers") print("====================") print() err, train_time, test_time = {}, {}, {} for name in sorted(selected_classifiers): print("Training %s ..." % name) err[name], train_time[name], test_time[name] = benchmark(classifiers[name]) ###################################################################### ## Print classification performance print() print("Classification performance:") print("===========================") print() def print_row(clf_type, train_time, test_time, err): print("%s %s %s %s" % (clf_type.ljust(12), ("%.4fs" % train_time).center(10), ("%.4fs" % test_time).center(10), ("%.4f" % err).center(10))) print("%s %s %s %s" % ("Classifier ", "train-time", "test-time", "error-rate")) print("-" * 44) for name in sorted(selected_classifiers, key=lambda name: err[name]): print_row(name, train_time[name], test_time[name], err[name]) print() print() scikit-learn-0.14.1/benchmarks/bench_glm.py000066400000000000000000000027251220075675300206040ustar00rootroot00000000000000""" A comparison of different methods in GLM Data comes from a random square matrix. """ from datetime import datetime import numpy as np from sklearn import linear_model from sklearn.utils.bench import total_seconds if __name__ == '__main__': import pylab as pl n_iter = 40 time_ridge = np.empty(n_iter) time_ols = np.empty(n_iter) time_lasso = np.empty(n_iter) dimensions = 500 * np.arange(1, n_iter + 1) for i in range(n_iter): print('Iteration %s of %s' % (i, n_iter)) n_samples, n_features = 10 * i + 3, 10 * i + 3 X = np.random.randn(n_samples, n_features) Y = np.random.randn(n_samples) start = datetime.now() ridge = linear_model.Ridge(alpha=1.) ridge.fit(X, Y) time_ridge[i] = total_seconds(datetime.now() - start) start = datetime.now() ols = linear_model.LinearRegression() ols.fit(X, Y) time_ols[i] = total_seconds(datetime.now() - start) start = datetime.now() lasso = linear_model.LassoLars() lasso.fit(X, Y) time_lasso[i] = total_seconds(datetime.now() - start) pl.figure('scikit-learn GLM benchmark results') pl.xlabel('Dimensions') pl.ylabel('Time (s)') pl.plot(dimensions, time_ridge, color='r') pl.plot(dimensions, time_ols, color='g') pl.plot(dimensions, time_lasso, color='b') pl.legend(['Ridge', 'OLS', 'LassoLars'], loc='upper left') pl.axis('tight') pl.show() scikit-learn-0.14.1/benchmarks/bench_glmnet.py000066400000000000000000000074101220075675300213070ustar00rootroot00000000000000""" To run this, you'll need to have installed. * glmnet-python * scikit-learn (of course) Does two benchmarks First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import numpy as np import gc from time import time from sklearn.datasets.samples_generator import make_regression alpha = 0.1 # alpha = 0.01 def rmse(a, b): return np.sqrt(np.mean((a - b) ** 2)) def bench(factory, X, Y, X_test, Y_test, ref_coef): gc.collect() # start time tstart = time() clf = factory(alpha=alpha).fit(X, Y) delta = (time() - tstart) # stop time print("duration: %0.3fs" % delta) print("rmse: %f" % rmse(Y_test, clf.predict(X_test))) print("mean coef abs diff: %f" % abs(ref_coef - clf.coef_.ravel()).mean()) return delta if __name__ == '__main__': from glmnet.elastic_net import Lasso as GlmnetLasso from sklearn.linear_model import Lasso as ScikitLasso # Delayed import of pylab import pylab as pl scikit_results = [] glmnet_results = [] n = 20 step = 500 n_features = 1000 n_informative = n_features / 10 n_test_samples = 1000 for i in range(1, n + 1): print('==================') print('Iteration %s of %s' % (i, n)) print('==================') X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:(i * step)] Y = Y[:(i * step)] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) pl.clf() xx = range(0, n * step, step) pl.title('Lasso regression on sample dataset (%d features)' % n_features) pl.plot(xx, scikit_results, 'b-', label='scikit-learn') pl.plot(xx, glmnet_results, 'r-', label='glmnet') pl.legend() pl.xlabel('number of samples to classify') pl.ylabel('Time (s)') pl.show() # now do a benchmark where the number of points is fixed # and the variable is the number of features scikit_results = [] glmnet_results = [] n = 20 step = 100 n_samples = 500 for i in range(1, n + 1): print('==================') print('Iteration %02d of %02d' % (i, n)) print('==================') n_features = i * step n_informative = n_features / 10 X, Y, coef_ = make_regression( n_samples=(i * step) + n_test_samples, n_features=n_features, noise=0.1, n_informative=n_informative, coef=True) X_test = X[-n_test_samples:] Y_test = Y[-n_test_samples:] X = X[:n_samples] Y = Y[:n_samples] print("benchmarking scikit-learn: ") scikit_results.append(bench(ScikitLasso, X, Y, X_test, Y_test, coef_)) print("benchmarking glmnet: ") glmnet_results.append(bench(GlmnetLasso, X, Y, X_test, Y_test, coef_)) xx = np.arange(100, 100 + n * step, step) pl.figure('scikit-learn vs. glmnet benchmark results') pl.title('Regression in high dimensional spaces (%d samples)' % n_samples) pl.plot(xx, scikit_results, 'b-', label='scikit-learn') pl.plot(xx, glmnet_results, 'r-', label='glmnet') pl.legend() pl.xlabel('number of features') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.14.1/benchmarks/bench_lasso.py000066400000000000000000000063511220075675300211450ustar00rootroot00000000000000""" Benchmarks of Lasso vs LassoLars First, we fix a training set and increase the number of samples. Then we plot the computation time as function of the number of samples. In the second benchmark, we increase the number of dimensions of the training set. Then we plot the computation time as function of the number of dimensions. In both cases, only 10% of the features are informative. """ import gc from time import time import numpy as np from sklearn.datasets.samples_generator import make_regression def compute_bench(alpha, n_samples, n_features, precompute): lasso_results = [] lars_lasso_results = [] it = 0 for ns in n_samples: for nf in n_features: it += 1 print('==================') print('Iteration %s of %s' % (it, max(len(n_samples), len(n_features)))) print('==================') n_informative = nf // 10 X, Y, coef_ = make_regression(n_samples=ns, n_features=nf, n_informative=n_informative, noise=0.1, coef=True) X /= np.sqrt(np.sum(X ** 2, axis=0)) # Normalize data gc.collect() print("- benchmarking Lasso") clf = Lasso(alpha=alpha, fit_intercept=False, precompute=precompute) tstart = time() clf.fit(X, Y) lasso_results.append(time() - tstart) gc.collect() print("- benchmarking LassoLars") clf = LassoLars(alpha=alpha, fit_intercept=False, normalize=False, precompute=precompute) tstart = time() clf.fit(X, Y) lars_lasso_results.append(time() - tstart) return lasso_results, lars_lasso_results if __name__ == '__main__': from sklearn.linear_model import Lasso, LassoLars import pylab as pl alpha = 0.01 # regularization parameter n_features = 10 list_n_samples = np.linspace(100, 1000000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, list_n_samples, [n_features], precompute=True) pl.figure('scikit-learn LASSO benchmark results') pl.subplot(211) pl.plot(list_n_samples, lasso_results, 'b-', label='Lasso') pl.plot(list_n_samples, lars_lasso_results, 'r-', label='LassoLars') pl.title('precomputed Gram matrix, %d features, alpha=%s' % (n_features, alpha)) pl.legend(loc='upper left') pl.xlabel('number of samples') pl.ylabel('Time (s)') pl.axis('tight') n_samples = 2000 list_n_features = np.linspace(500, 3000, 5).astype(np.int) lasso_results, lars_lasso_results = compute_bench(alpha, [n_samples], list_n_features, precompute=False) pl.subplot(212) pl.plot(list_n_features, lasso_results, 'b-', label='Lasso') pl.plot(list_n_features, lars_lasso_results, 'r-', label='LassoLars') pl.title('%d samples, alpha=%s' % (n_samples, alpha)) pl.legend(loc='upper left') pl.xlabel('number of features') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.14.1/benchmarks/bench_plot_fastkmeans.py000066400000000000000000000111041220075675300232060ustar00rootroot00000000000000from __future__ import print_function from collections import defaultdict from time import time import numpy as np from numpy import random as nr from sklearn.cluster.k_means_ import KMeans, MiniBatchKMeans def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) chunk = 100 max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() data = nr.random_integers(-50, 50, (n_samples, n_features)) print('K-Means') tstart = time() kmeans = KMeans(init='k-means++', n_clusters=10).fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.5f" % kmeans.inertia_) print() results['kmeans_speed'].append(delta) results['kmeans_quality'].append(kmeans.inertia_) print('Fast K-Means') # let's prepare the data in small chunks mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=10, batch_size=chunk) tstart = time() mbkmeans.fit(data) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %f" % mbkmeans.inertia_) print() print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results def compute_bench_2(chunks): results = defaultdict(lambda: []) n_features = 50000 means = np.array([[1, 1], [-1, -1], [1, -1], [-1, 1], [0.5, 0.5], [0.75, -0.5], [-1, 0.75], [1, 0]]) X = np.empty((0, 2)) for i in range(8): X = np.r_[X, means[i] + 0.8 * np.random.randn(n_features, 2)] max_it = len(chunks) it = 0 for chunk in chunks: it += 1 print('==============================') print('Iteration %03d of %03d' % (it, max_it)) print('==============================') print() print('Fast K-Means') tstart = time() mbkmeans = MiniBatchKMeans(init='k-means++', n_clusters=8, batch_size=chunk) mbkmeans.fit(X) delta = time() - tstart print("Speed: %0.3fs" % delta) print("Inertia: %0.3fs" % mbkmeans.inertia_) print() results['MiniBatchKMeans Speed'].append(delta) results['MiniBatchKMeans Quality'].append(mbkmeans.inertia_) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(50, 150, 5).astype(np.int) features_range = np.linspace(150, 50000, 5).astype(np.int) chunks = np.linspace(500, 10000, 15).astype(np.int) results = compute_bench(samples_range, features_range) results_2 = compute_bench_2(chunks) max_time = max([max(i) for i in [t for (label, t) in results.iteritems() if "speed" in label]]) max_inertia = max([max(i) for i in [ t for (label, t) in results.iteritems() if "speed" not in label]]) fig = plt.figure('scikit-learn K-Means benchmark results') for c, (label, timings) in zip('brcy', sorted(results.iteritems())): if 'speed' in label: ax = fig.add_subplot(2, 2, 1, projection='3d') ax.set_zlim3d(0.0, max_time * 1.1) else: ax = fig.add_subplot(2, 2, 2, projection='3d') ax.set_zlim3d(0.0, max_inertia * 1.1) X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.5) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') i = 0 for c, (label, timings) in zip('br', sorted(results_2.iteritems())): i += 1 ax = fig.add_subplot(2, 2, i + 2) y = np.asarray(timings) ax.plot(chunks, y, color=c, alpha=0.8) ax.set_xlabel('Chunks') ax.set_ylabel(label) plt.show() scikit-learn-0.14.1/benchmarks/bench_plot_lasso_path.py000066400000000000000000000076431220075675300232240ustar00rootroot00000000000000"""Benchmarks of Lasso regularization path computation using Lars and CD The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function from collections import defaultdict import gc import sys from time import time import numpy as np from sklearn.linear_model import lars_path from sklearn.linear_model import lasso_path from sklearn.datasets.samples_generator import make_regression def compute_bench(samples_range, features_range): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') dataset_kwargs = { 'n_samples': n_samples, 'n_features': n_features, 'n_informative': n_features / 10, 'effective_rank': min(n_samples, n_features) / 10, #'effective_rank': None, 'bias': 0.0, } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) X, y = make_regression(**dataset_kwargs) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (with Gram)'].append(delta) gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, method='lasso') delta = time() - tstart print("%0.3fs" % delta) results['lars_path (without Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (with Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=True) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (with Gram)'].append(delta) gc.collect() print("benchmarking lasso_path (without Gram):", end='') sys.stdout.flush() tstart = time() lasso_path(X, y, precompute=False) delta = time() - tstart print("%0.3fs" % delta) results['lasso_path (without Gram)'].append(delta) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(10, 2000, 5).astype(np.int) features_range = np.linspace(10, 2000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(max(t) for t in results.values()) fig = plt.figure('scikit-learn Lasso path benchmark results') i = 1 for c, (label, timings) in zip('bcry', sorted(results.items())): ax = fig.add_subplot(2, 2, i, projection='3d') X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z.T, cstride=1, rstride=1, color=c, alpha=0.8) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) #ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.set_zlim3d(0.0, max_time * 1.1) ax.set_title(label) #ax.legend() i += 1 plt.show() scikit-learn-0.14.1/benchmarks/bench_plot_neighbors.py000066400000000000000000000145431220075675300230440ustar00rootroot00000000000000""" Plot the scaling of the nearest neighbors algorithms with k, D, and N """ from time import time import numpy as np import pylab as pl from matplotlib import ticker from sklearn import neighbors, datasets def get_data(N, D, dataset='dense'): if dataset == 'dense': np.random.seed(0) return np.random.random((N, D)) elif dataset == 'digits': X = datasets.load_digits().data i = np.argsort(X[0])[::-1] X = X[:, i] return X[:N, :D] else: raise ValueError("invalid dataset: %s" % dataset) def barplot_neighbors(Nrange=2 ** np.arange(1, 11), Drange=2 ** np.arange(7), krange=2 ** np.arange(10), N=1000, D=64, k=5, leaf_size=30, dataset='digits'): algorithms = ('kd_tree', 'brute', 'ball_tree') fiducial_values = {'N': N, 'D': D, 'k': k} #------------------------------------------------------------ # varying N N_results_build = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) N_results_query = dict([(alg, np.zeros(len(Nrange))) for alg in algorithms]) for i, NN in enumerate(Nrange): print("N = %i (%i out of %i)" % (NN, i + 1, len(Nrange))) X = get_data(NN, D, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=min(NN, k), algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() N_results_build[algorithm][i] = (t1 - t0) N_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying D D_results_build = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) D_results_query = dict([(alg, np.zeros(len(Drange))) for alg in algorithms]) for i, DD in enumerate(Drange): print("D = %i (%i out of %i)" % (DD, i + 1, len(Drange))) X = get_data(N, DD, dataset) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=k, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() D_results_build[algorithm][i] = (t1 - t0) D_results_query[algorithm][i] = (t2 - t1) #------------------------------------------------------------ # varying k k_results_build = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) k_results_query = dict([(alg, np.zeros(len(krange))) for alg in algorithms]) X = get_data(N, DD, dataset) for i, kk in enumerate(krange): print("k = %i (%i out of %i)" % (kk, i + 1, len(krange))) for algorithm in algorithms: nbrs = neighbors.NearestNeighbors(n_neighbors=kk, algorithm=algorithm, leaf_size=leaf_size) t0 = time() nbrs.fit(X) t1 = time() nbrs.kneighbors(X) t2 = time() k_results_build[algorithm][i] = (t1 - t0) k_results_query[algorithm][i] = (t2 - t1) pl.figure('scikit-learn nearest neighbors benchmark results', figsize=(8, 11)) for (sbplt, vals, quantity, build_time, query_time) in [(311, Nrange, 'N', N_results_build, N_results_query), (312, Drange, 'D', D_results_build, D_results_query), (313, krange, 'k', k_results_build, k_results_query)]: ax = pl.subplot(sbplt, yscale='log') pl.grid(True) tick_vals = [] tick_labels = [] bottom = 10 ** np.min([min(np.floor(np.log10(build_time[alg]))) for alg in algorithms]) for i, alg in enumerate(algorithms): xvals = 0.1 + i * (1 + len(vals)) + np.arange(len(vals)) width = 0.8 c_bar = pl.bar(xvals, build_time[alg] - bottom, width, bottom, color='r') q_bar = pl.bar(xvals, query_time[alg], width, build_time[alg], color='b') tick_vals += list(xvals + 0.5 * width) tick_labels += ['%i' % val for val in vals] pl.text((i + 0.02) / len(algorithms), 0.98, alg, transform=ax.transAxes, ha='left', va='top', bbox=dict(facecolor='w', edgecolor='w', alpha=0.5)) pl.ylabel('Time (s)') ax.xaxis.set_major_locator(ticker.FixedLocator(tick_vals)) ax.xaxis.set_major_formatter(ticker.FixedFormatter(tick_labels)) for label in ax.get_xticklabels(): label.set_rotation(-90) label.set_fontsize(10) title_string = 'Varying %s' % quantity descr_string = '' for s in 'NDk': if s == quantity: pass else: descr_string += '%s = %i, ' % (s, fiducial_values[s]) descr_string = descr_string[:-2] pl.text(1.01, 0.5, title_string, transform=ax.transAxes, rotation=-90, ha='left', va='center', fontsize=20) pl.text(0.99, 0.5, descr_string, transform=ax.transAxes, rotation=-90, ha='right', va='center') pl.gcf().suptitle("%s data set" % dataset.capitalize(), fontsize=16) pl.figlegend((c_bar, q_bar), ('construction', 'N-point query'), 'upper right') if __name__ == '__main__': barplot_neighbors(dataset='digits') barplot_neighbors(dataset='dense') pl.show() scikit-learn-0.14.1/benchmarks/bench_plot_nmf.py000066400000000000000000000132671220075675300216460ustar00rootroot00000000000000""" Benchmarks of Non-Negative Matrix Factorization """ from __future__ import print_function import gc from time import time import numpy as np from collections import defaultdict from sklearn.decomposition.nmf import NMF, _initialize_nmf from sklearn.datasets.samples_generator import make_low_rank_matrix from sklearn.externals.six.moves import xrange def alt_nnmf(V, r, max_iter=1000, tol=1e-3, R=None): ''' A, S = nnmf(X, r, tol=1e-3, R=None) Implement Lee & Seung's algorithm Parameters ---------- V : 2-ndarray, [n_samples, n_features] input matrix r : integer number of latent features max_iter : integer, optional maximum number of iterations (default: 10000) tol : double tolerance threshold for early exit (when the update factor is within tol of 1., the function exits) R : integer, optional random seed Returns ------- A : 2-ndarray, [n_samples, r] Component part of the factorization S : 2-ndarray, [r, n_features] Data part of the factorization Reference --------- "Algorithms for Non-negative Matrix Factorization" by Daniel D Lee, Sebastian H Seung (available at http://citeseer.ist.psu.edu/lee01algorithms.html) ''' # Nomenclature in the function follows Lee & Seung eps = 1e-5 n, m = V.shape if R == "svd": W, H = _initialize_nmf(V, r) elif R is None: R = np.random.mtrand._rand W = np.abs(R.standard_normal((n, r))) H = np.abs(R.standard_normal((r, m))) for i in xrange(max_iter): updateH = np.dot(W.T, V) / (np.dot(np.dot(W.T, W), H) + eps) H *= updateH updateW = np.dot(V, H.T) / (np.dot(W, np.dot(H, H.T)) + eps) W *= updateW if True or (i % 10) == 0: max_update = max(updateW.max(), updateH.max()) if abs(1. - max_update) < tol: break return W, H def compute_bench(samples_range, features_range, rank=50, tolerance=1e-7): it = 0 timeset = defaultdict(lambda: []) err = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') X = np.abs(make_low_rank_matrix(n_samples, n_features, effective_rank=rank, tail_strength=0.2)) gc.collect() print("benchmarking nndsvd-nmf: ") tstart = time() m = NMF(n_components=30, tol=tolerance, init='nndsvd').fit(X) tend = time() - tstart timeset['nndsvd-nmf'].append(tend) err['nndsvd-nmf'].append(m.reconstruction_err_) print(m.reconstruction_err_, tend) gc.collect() print("benchmarking nndsvda-nmf: ") tstart = time() m = NMF(n_components=30, init='nndsvda', tol=tolerance).fit(X) tend = time() - tstart timeset['nndsvda-nmf'].append(tend) err['nndsvda-nmf'].append(m.reconstruction_err_) print(m.reconstruction_err_, tend) gc.collect() print("benchmarking nndsvdar-nmf: ") tstart = time() m = NMF(n_components=30, init='nndsvdar', tol=tolerance).fit(X) tend = time() - tstart timeset['nndsvdar-nmf'].append(tend) err['nndsvdar-nmf'].append(m.reconstruction_err_) print(m.reconstruction_err_, tend) gc.collect() print("benchmarking random-nmf") tstart = time() m = NMF(n_components=30, init=None, max_iter=1000, tol=tolerance).fit(X) tend = time() - tstart timeset['random-nmf'].append(tend) err['random-nmf'].append(m.reconstruction_err_) print(m.reconstruction_err_, tend) gc.collect() print("benchmarking alt-random-nmf") tstart = time() W, H = alt_nnmf(X, r=30, R=None, tol=tolerance) tend = time() - tstart timeset['alt-random-nmf'].append(tend) err['alt-random-nmf'].append(np.linalg.norm(X - np.dot(W, H))) print(np.linalg.norm(X - np.dot(W, H)), tend) return timeset, err if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection axes3d import matplotlib.pyplot as plt samples_range = np.linspace(50, 500, 3).astype(np.int) features_range = np.linspace(50, 500, 3).astype(np.int) timeset, err = compute_bench(samples_range, features_range) for i, results in enumerate((timeset, err)): fig = plt.figure('scikit-learn Non-Negative Matrix Factorization benchmkar results') ax = fig.gca(projection='3d') for c, (label, timings) in zip('rbgcm', sorted(results.iteritems())): X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3, color=c) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') zlabel = 'Time (s)' if i == 0 else 'reconstruction error' ax.set_zlabel(zlabel) ax.legend() plt.show() scikit-learn-0.14.1/benchmarks/bench_plot_omp_lars.py000066400000000000000000000105511220075675300226730ustar00rootroot00000000000000"""Benchmarks of orthogonal matching pursuit (:ref:`OMP`) versus least angle regression (:ref:`least_angle_regression`) The input data is mostly low rank but is a fat infinite tail. """ from __future__ import print_function import gc import sys from time import time import numpy as np from sklearn.linear_model import lars_path, orthogonal_mp from sklearn.datasets.samples_generator import make_sparse_coded_signal def compute_bench(samples_range, features_range): it = 0 results = dict() lars = np.empty((len(features_range), len(samples_range))) lars_gram = lars.copy() omp = lars.copy() omp_gram = lars.copy() max_it = len(samples_range) * len(features_range) for i_s, n_samples in enumerate(samples_range): for i_f, n_features in enumerate(features_range): it += 1 n_informative = n_features / 10 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') # dataset_kwargs = { # 'n_train_samples': n_samples, # 'n_test_samples': 2, # 'n_features': n_features, # 'n_informative': n_informative, # 'effective_rank': min(n_samples, n_features) / 10, # #'effective_rank': None, # 'bias': 0.0, # } dataset_kwargs = { 'n_samples': 1, 'n_components': n_features, 'n_features': n_samples, 'n_nonzero_coefs': n_informative, 'random_state': 0 } print("n_samples: %d" % n_samples) print("n_features: %d" % n_features) y, X, _ = make_sparse_coded_signal(**dataset_kwargs) X = np.asfortranarray(X) gc.collect() print("benchmarking lars_path (with Gram):", end='') sys.stdout.flush() tstart = time() G = np.dot(X.T, X) # precomputed Gram matrix Xy = np.dot(X.T, y) lars_path(X, y, Xy=Xy, Gram=G, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars_gram[i_f, i_s] = delta gc.collect() print("benchmarking lars_path (without Gram):", end='') sys.stdout.flush() tstart = time() lars_path(X, y, Gram=None, max_iter=n_informative) delta = time() - tstart print("%0.3fs" % delta) lars[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (with Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute_gram=True, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp_gram[i_f, i_s] = delta gc.collect() print("benchmarking orthogonal_mp (without Gram):", end='') sys.stdout.flush() tstart = time() orthogonal_mp(X, y, precompute_gram=False, n_nonzero_coefs=n_informative) delta = time() - tstart print("%0.3fs" % delta) omp[i_f, i_s] = delta results['time(LARS) / time(OMP)\n (w/ Gram)'] = (lars_gram / omp_gram) results['time(LARS) / time(OMP)\n (w/o Gram)'] = (lars / omp) return results if __name__ == '__main__': samples_range = np.linspace(1000, 5000, 5).astype(np.int) features_range = np.linspace(1000, 5000, 5).astype(np.int) results = compute_bench(samples_range, features_range) max_time = max(np.max(t) for t in results.values()) import pylab as pl fig = pl.figure('scikit-learn OMP vs. LARS benchmark results') for i, (label, timings) in enumerate(sorted(results.iteritems())): ax = fig.add_subplot(1, 2, i) vmax = max(1 - timings.min(), -1 + timings.max()) pl.matshow(timings, fignum=False, vmin=1 - vmax, vmax=1 + vmax) ax.set_xticklabels([''] + map(str, samples_range)) ax.set_yticklabels([''] + map(str, features_range)) pl.xlabel('n_samples') pl.ylabel('n_features') pl.title(label) pl.subplots_adjust(0.1, 0.08, 0.96, 0.98, 0.4, 0.63) ax = pl.axes([0.1, 0.08, 0.8, 0.06]) pl.colorbar(cax=ax, orientation='horizontal') pl.show() scikit-learn-0.14.1/benchmarks/bench_plot_parallel_pairwise.py000066400000000000000000000023371220075675300245610ustar00rootroot00000000000000# Author: Mathieu Blondel # License: BSD 3 clause import time import pylab as pl from sklearn.utils import check_random_state from sklearn.metrics.pairwise import pairwise_distances from sklearn.metrics.pairwise import pairwise_kernels def plot(func): random_state = check_random_state(0) one_core = [] multi_core = [] sample_sizes = range(1000, 6000, 1000) for n_samples in sample_sizes: X = random_state.rand(n_samples, 300) start = time.time() func(X, n_jobs=1) one_core.append(time.time() - start) start = time.time() func(X, n_jobs=-1) multi_core.append(time.time() - start) pl.figure('scikit-learn parallel %s benchmark results' % func.__name__) pl.plot(sample_sizes, one_core, label="one core") pl.plot(sample_sizes, multi_core, label="multi core") pl.xlabel('n_samples') pl.ylabel('Time (s)') pl.title('Parallel %s' % func.__name__) pl.legend() def euclidean_distances(X, n_jobs): return pairwise_distances(X, metric="euclidean", n_jobs=n_jobs) def rbf_kernels(X, n_jobs): return pairwise_kernels(X, metric="rbf", n_jobs=n_jobs, gamma=0.1) plot(euclidean_distances) plot(rbf_kernels) pl.show() scikit-learn-0.14.1/benchmarks/bench_plot_svd.py000066400000000000000000000055231220075675300216560ustar00rootroot00000000000000"""Benchmarks of Singular Value Decomposition (Exact and Approximate) The data is mostly low rank but is a fat infinite tail. """ import gc from time import time import numpy as np from collections import defaultdict from scipy.linalg import svd from sklearn.utils.extmath import randomized_svd from sklearn.datasets.samples_generator import make_low_rank_matrix def compute_bench(samples_range, features_range, n_iter=3, rank=50): it = 0 results = defaultdict(lambda: []) max_it = len(samples_range) * len(features_range) for n_samples in samples_range: for n_features in features_range: it += 1 print('====================') print('Iteration %03d of %03d' % (it, max_it)) print('====================') X = make_low_rank_matrix(n_samples, n_features, effective_rank=rank, tail_strength=0.2) gc.collect() print("benchmarking scipy svd: ") tstart = time() svd(X, full_matrices=False) results['scipy svd'].append(time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=0") tstart = time() randomized_svd(X, rank, n_iter=0) results['scikit-learn randomized_svd (n_iter=0)'].append( time() - tstart) gc.collect() print("benchmarking scikit-learn randomized_svd: n_iter=%d " % n_iter) tstart = time() randomized_svd(X, rank, n_iter=n_iter) results['scikit-learn randomized_svd (n_iter=%d)' % n_iter].append(time() - tstart) return results if __name__ == '__main__': from mpl_toolkits.mplot3d import axes3d # register the 3d projection import matplotlib.pyplot as plt samples_range = np.linspace(2, 1000, 4).astype(np.int) features_range = np.linspace(2, 1000, 4).astype(np.int) results = compute_bench(samples_range, features_range) label = 'scikit-learn singular value decomposition benchmark results' fig = plt.figure(label) ax = fig.gca(projection='3d') for c, (label, timings) in zip('rbg', sorted(results.iteritems())): X, Y = np.meshgrid(samples_range, features_range) Z = np.asarray(timings).reshape(samples_range.shape[0], features_range.shape[0]) # plot the actual surface ax.plot_surface(X, Y, Z, rstride=8, cstride=8, alpha=0.3, color=c) # dummy point plot to stick the legend to since surface plot do not # support legends (yet?) ax.plot([1], [1], [1], color=c, label=label) ax.set_xlabel('n_samples') ax.set_ylabel('n_features') ax.set_zlabel('Time (s)') ax.legend() plt.show() scikit-learn-0.14.1/benchmarks/bench_plot_ward.py000066400000000000000000000022661220075675300220200ustar00rootroot00000000000000""" Benchmark scikit-learn's Ward implement compared to SciPy's """ import time import numpy as np from scipy.cluster import hierarchy import pylab as pl from sklearn.cluster import Ward ward = Ward(n_clusters=3) n_samples = np.logspace(.5, 3, 9) n_features = np.logspace(1, 3.5, 7) N_samples, N_features = np.meshgrid(n_samples, n_features) scikits_time = np.zeros(N_samples.shape) scipy_time = np.zeros(N_samples.shape) for i, n in enumerate(n_samples): for j, p in enumerate(n_features): X = np.random.normal(size=(n, p)) t0 = time.time() ward.fit(X) scikits_time[j, i] = time.time() - t0 t0 = time.time() hierarchy.ward(X) scipy_time[j, i] = time.time() - t0 ratio = scikits_time / scipy_time pl.figure("scikit-learn Ward's method benchmark results") pl.imshow(np.log(ratio), aspect='auto', origin="lower") pl.colorbar() pl.contour(ratio, levels=[1, ], colors='k') pl.yticks(range(len(n_features)), n_features.astype(np.int)) pl.ylabel('N features') pl.xticks(range(len(n_samples)), n_samples.astype(np.int)) pl.xlabel('N samples') pl.title("Scikit's time, in units of scipy time (log)") pl.show() scikit-learn-0.14.1/benchmarks/bench_random_projections.py000066400000000000000000000213021220075675300237140ustar00rootroot00000000000000""" =========================== Random projection benchmark =========================== Benchmarks for random projections. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import collections import numpy as np import scipy.sparse as sp from sklearn import clone from sklearn.externals.six.moves import xrange from sklearn.random_projection import (SparseRandomProjection, GaussianRandomProjection, johnson_lindenstrauss_min_dim) def type_auto_or_float(val): if val == "auto": return "auto" else: return float(val) def type_auto_or_int(val): if val == "auto": return "auto" else: return int(val) def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_scikit_transformer(X, transfomer): gc.collect() clf = clone(transfomer) # start time t_start = datetime.now() clf.fit(X) delta = (datetime.now() - t_start) # stop time time_to_fit = compute_time(t_start, delta) # start time t_start = datetime.now() clf.transform(X) delta = (datetime.now() - t_start) # stop time time_to_transform = compute_time(t_start, delta) return time_to_fit, time_to_transform # Make some random data with uniformly located non zero entries with # Gaussian distributed values def make_sparse_random_data(n_samples, n_features, n_nonzeros, random_state=None): rng = np.random.RandomState(random_state) data_coo = sp.coo_matrix( (rng.randn(n_nonzeros), (rng.randint(n_samples, size=n_nonzeros), rng.randint(n_features, size=n_nonzeros))), shape=(n_samples, n_features)) return data_coo.toarray(), data_coo.tocsr() def print_row(clf_type, time_fit, time_transform): print("%s | %s | %s" % (clf_type.ljust(30), ("%.4fs" % time_fit).center(12), ("%.4fs" % time_transform).center(12))) if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-features", dest="n_features", default=10 ** 4, type=int, help="Number of features in the benchmarks") op.add_option("--n-components", dest="n_components", default="auto", help="Size of the random subspace." "('auto' or int > 0)") op.add_option("--ratio-nonzeros", dest="ratio_nonzeros", default=10 ** -3, type=float, help="Number of features in the benchmarks") op.add_option("--n-samples", dest="n_samples", default=500, type=int, help="Number of samples in the benchmarks") op.add_option("--random-seed", dest="random_seed", default=13, type=int, help="Seed used by the random number generators.") op.add_option("--density", dest="density", default=1 / 3, help="Density used by the sparse random projection." "('auto' or float (0.0, 1.0]") op.add_option("--eps", dest="eps", default=0.5, type=float, help="See the documentation of the underlying transformers.") op.add_option("--transformers", dest="selected_transformers", default='GaussianRandomProjection,SparseRandomProjection', type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. Available: " "GaussianRandomProjection,SparseRandomProjection") op.add_option("--dense", dest="dense", default=False, action="store_true", help="Set input space as a dense matrix.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) opts.n_components = type_auto_or_int(opts.n_components) opts.density = type_auto_or_float(opts.density) selected_transformers = opts.selected_transformers.split(',') ########################################################################### # Generate dataset ########################################################################### n_nonzeros = int(opts.ratio_nonzeros * opts.n_features) print('Dataset statics') print("===========================") print('n_samples \t= %s' % opts.n_samples) print('n_features \t= %s' % opts.n_features) if opts.n_components == "auto": print('n_components \t= %s (auto)' % johnson_lindenstrauss_min_dim(n_samples=opts.n_samples, eps=opts.eps)) else: print('n_components \t= %s' % opts.n_components) print('n_elements \t= %s' % (opts.n_features * opts.n_samples)) print('n_nonzeros \t= %s per feature' % n_nonzeros) print('ratio_nonzeros \t= %s' % opts.ratio_nonzeros) print('') ########################################################################### # Set transformer input ########################################################################### transformers = {} ########################################################################### # Set GaussianRandomProjection input gaussian_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed } transformers["GaussianRandomProjection"] = \ GaussianRandomProjection(**gaussian_matrix_params) ########################################################################### # Set SparseRandomProjection input sparse_matrix_params = { "n_components": opts.n_components, "random_state": opts.random_seed, "density": opts.density, "eps": opts.eps, } transformers["SparseRandomProjection"] = \ SparseRandomProjection(**sparse_matrix_params) ########################################################################### # Perform benchmark ########################################################################### time_fit = collections.defaultdict(list) time_transform = collections.defaultdict(list) print('Benchmarks') print("===========================") print("Generate dataset benchmarks... ", end="") X_dense, X_sparse = make_sparse_random_data(opts.n_samples, opts.n_features, n_nonzeros, random_state=opts.random_seed) X = X_dense if opts.dense else X_sparse print("done") for name in selected_transformers: print("Perform benchmarks for %s..." % name) for iteration in xrange(opts.n_times): print("\titer %s..." % iteration, end="") time_to_fit, time_to_transform = bench_scikit_transformer(X_dense, transformers[name]) time_fit[name].append(time_to_fit) time_transform[name].append(time_to_transform) print("done") print("") ########################################################################### # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Transformer performance:") print("===========================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") print("%s | %s | %s" % ("Transformer".ljust(30), "fit".center(12), "transform".center(12))) print(31 * "-" + ("|" + "-" * 14) * 2) for name in sorted(selected_transformers): print_row(name, np.mean(time_fit[name]), np.mean(time_transform[name])) print("") print("") scikit-learn-0.14.1/benchmarks/bench_sample_without_replacement.py000066400000000000000000000175101220075675300254460ustar00rootroot00000000000000""" Benchmarks for sampling without replacement of integer. """ from __future__ import division from __future__ import print_function import gc import sys import optparse from datetime import datetime import operator import matplotlib.pyplot as plt import numpy as np import random from sklearn.externals.six.moves import xrange from sklearn.utils.random import sample_without_replacement def compute_time(t_start, delta): mu_second = 0.0 + 10 ** 6 # number of microseconds in a second return delta.seconds + delta.microseconds / mu_second def bench_sample(sampling, n_population, n_samples): gc.collect() # start time t_start = datetime.now() sampling(n_population, n_samples) delta = (datetime.now() - t_start) # stop time time = compute_time(t_start, delta) return time if __name__ == "__main__": ########################################################################### # Option parser ########################################################################### op = optparse.OptionParser() op.add_option("--n-times", dest="n_times", default=5, type=int, help="Benchmark results are average over n_times experiments") op.add_option("--n-population", dest="n_population", default=100000, type=int, help="Size of the population to sample from.") op.add_option("--n-step", dest="n_steps", default=5, type=int, help="Number of step interval between 0 and n_population.") default_algorithms = "custom-tracking-selection,custom-auto," \ "custom-reservoir-sampling,custom-pool,"\ "python-core-sample,numpy-permutation" op.add_option("--algorithm", dest="selected_algorithm", default=default_algorithms, type=str, help="Comma-separated list of transformer to benchmark. " "Default: %default. \nAvailable: %default") # op.add_option("--random-seed", # dest="random_seed", default=13, type=int, # help="Seed used by the random number generators.") (opts, args) = op.parse_args() if len(args) > 0: op.error("this script takes no arguments.") sys.exit(1) selected_algorithm = opts.selected_algorithm.split(',') for key in selected_algorithm: if key not in default_algorithms.split(','): raise ValueError("Unknown sampling algorithm \"%s\" not in (%s)." % (key, default_algorithms)) ########################################################################### # List sampling algorithm ########################################################################### # We assume that sampling algorithm has the following signature: # sample(n_population, n_sample) # sampling_algorithm = {} ########################################################################### # Set Python core input sampling_algorithm["python-core-sample"] = \ lambda n_population, n_sample: \ random.sample(xrange(n_population), n_sample) ########################################################################### # Set custom automatic method selection sampling_algorithm["custom-auto"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="auto", random_state=random_state) ########################################################################### # Set custom tracking based method sampling_algorithm["custom-tracking-selection"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="tracking_selection", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-reservoir-sampling"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="reservoir_sampling", random_state=random_state) ########################################################################### # Set custom reservoir based method sampling_algorithm["custom-pool"] = \ lambda n_population, n_samples, random_state=None: \ sample_without_replacement(n_population, n_samples, method="pool", random_state=random_state) ########################################################################### # Numpy permutation based sampling_algorithm["numpy-permutation"] = \ lambda n_population, n_sample: \ np.random.permutation(n_population)[:n_sample] ########################################################################### # Remove unspecified algorithm sampling_algorithm = dict((key, value) for key, value in sampling_algorithm.items() if key in selected_algorithm) ########################################################################### # Perform benchmark ########################################################################### time = {} n_samples = np.linspace(start=0, stop=opts.n_population, num=opts.n_steps).astype(np.int) ratio = n_samples / opts.n_population print('Benchmarks') print("===========================") for name in sorted(sampling_algorithm): print("Perform benchmarks for %s..." % name, end="") time[name] = np.zeros(shape=(opts.n_steps, opts.n_times)) for step in xrange(opts.n_steps): for it in xrange(opts.n_times): time[name][step, it] = bench_sample(sampling_algorithm[name], opts.n_population, n_samples[step]) print("done") print("Averaging results...", end="") for name in sampling_algorithm: time[name] = np.mean(time[name], axis=1) print("done\n") # Print results ########################################################################### print("Script arguments") print("===========================") arguments = vars(opts) print("%s \t | %s " % ("Arguments".ljust(16), "Value".center(12),)) print(25 * "-" + ("|" + "-" * 14) * 1) for key, value in arguments.items(): print("%s \t | %s " % (str(key).ljust(16), str(value).strip().center(12))) print("") print("Sampling algorithm performance:") print("===============================") print("Results are averaged over %s repetition(s)." % opts.n_times) print("") fig = plt.figure('scikit-learn sample w/o replacement benchmark results') plt.title("n_population = %s, n_times = %s" % (opts.n_population, opts.n_times)) ax = fig.add_subplot(111) for name in sampling_algorithm: ax.plot(ratio, time[name], label=name) ax.set_xlabel('ratio of n_sample / n_population') ax.set_ylabel('Time (s)') ax.legend() # Sort legend labels handles, labels = ax.get_legend_handles_labels() hl = sorted(zip(handles, labels), key=operator.itemgetter(1)) handles2, labels2 = zip(*hl) ax.legend(handles2, labels2, loc=0) plt.show() scikit-learn-0.14.1/benchmarks/bench_sgd_regression.py000066400000000000000000000107551220075675300230440ustar00rootroot00000000000000""" Benchmark for SGD regression Compares SGD regression against coordinate descent and Ridge on synthetic data. """ print(__doc__) # Author: Peter Prettenhofer # License: BSD 3 clause import numpy as np import pylab as pl import gc from time import time from sklearn.linear_model import Ridge, SGDRegressor, ElasticNet from sklearn.metrics import mean_squared_error from sklearn.datasets.samples_generator import make_regression if __name__ == "__main__": list_n_samples = np.linspace(100, 10000, 5).astype(np.int) list_n_features = [10, 100, 1000] n_test = 1000 noise = 0.1 alpha = 0.01 sgd_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) elnet_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) ridge_results = np.zeros((len(list_n_samples), len(list_n_features), 2)) for i, n_train in enumerate(list_n_samples): for j, n_features in enumerate(list_n_features): X, y, coef = make_regression( n_samples=n_train + n_test, n_features=n_features, noise=noise, coef=True) X_train = X[:n_train] y_train = y[:n_train] X_test = X[n_train:] y_test = y[n_train:] print("=======================") print("Round %d %d" % (i, j)) print("n_features:", n_features) print("n_samples:", n_train) # Shuffle data idx = np.arange(n_train) np.random.seed(13) np.random.shuffle(idx) X_train = X_train[idx] y_train = y_train[idx] std = X_train.std(axis=0) mean = X_train.mean(axis=0) X_train = (X_train - mean) / std X_test = (X_test - mean) / std std = y_train.std(axis=0) mean = y_train.mean(axis=0) y_train = (y_train - mean) / std y_test = (y_test - mean) / std gc.collect() print("- benchmarking ElasticNet") clf = ElasticNet(alpha=alpha, rho=0.5, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) elnet_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) elnet_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking SGD") n_iter = np.ceil(10 ** 4.0 / n_train) clf = SGDRegressor(alpha=alpha, fit_intercept=False, n_iter=n_iter, learning_rate="invscaling", eta0=.01, power_t=0.25) tstart = time() clf.fit(X_train, y_train) sgd_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) sgd_results[i, j, 1] = time() - tstart gc.collect() print("- benchmarking RidgeRegression") clf = Ridge(alpha=alpha, fit_intercept=False) tstart = time() clf.fit(X_train, y_train) ridge_results[i, j, 0] = mean_squared_error(clf.predict(X_test), y_test) ridge_results[i, j, 1] = time() - tstart # Plot results i = 0 m = len(list_n_features) pl.figure('scikit-learn SGD regression benchmark results', figsize=(5 * 2, 4 * m)) for j in range(m): pl.subplot(m, 2, i + 1) pl.plot(list_n_samples, np.sqrt(elnet_results[:, j, 0]), label="ElasticNet") pl.plot(list_n_samples, np.sqrt(sgd_results[:, j, 0]), label="SGDRegressor") pl.plot(list_n_samples, np.sqrt(ridge_results[:, j, 0]), label="Ridge") pl.legend(prop={"size": 10}) pl.xlabel("n_train") pl.ylabel("RMSE") pl.title("Test error - %d features" % list_n_features[j]) i += 1 pl.subplot(m, 2, i + 1) pl.plot(list_n_samples, np.sqrt(elnet_results[:, j, 1]), label="ElasticNet") pl.plot(list_n_samples, np.sqrt(sgd_results[:, j, 1]), label="SGDRegressor") pl.plot(list_n_samples, np.sqrt(ridge_results[:, j, 1]), label="Ridge") pl.legend(prop={"size": 10}) pl.xlabel("n_train") pl.ylabel("Time [sec]") pl.title("Training time - %d features" % list_n_features[j]) i += 1 pl.subplots_adjust(hspace=.30) pl.show() scikit-learn-0.14.1/benchmarks/bench_tree.py000066400000000000000000000070411220075675300207600ustar00rootroot00000000000000""" To run this, you'll need to have installed. * scikit-learn Does two benchmarks First, we fix a training set, increase the number of samples to classify and plot number of classified samples as a function of time. In the second benchmark, we increase the number of dimensions of the training set, classify a sample and plot the time taken as a function of the number of dimensions. """ import numpy as np import pylab as pl import gc from datetime import datetime # to store the results scikit_classifier_results = [] scikit_regressor_results = [] mu_second = 0.0 + 10 ** 6 # number of microseconds in a second def bench_scikit_tree_classifier(X, Y): """Benchmark with scikit-learn decision tree classifier""" from sklearn.tree import DecisionTreeClassifier gc.collect() # start time tstart = datetime.now() clf = DecisionTreeClassifier() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_classifier_results.append( delta.seconds + delta.microseconds / mu_second) def bench_scikit_tree_regressor(X, Y): """Benchmark with scikit-learn decision tree regressor""" from sklearn.tree import DecisionTreeRegressor gc.collect() # start time tstart = datetime.now() clf = DecisionTreeRegressor() clf.fit(X, Y).predict(X) delta = (datetime.now() - tstart) # stop time scikit_regressor_results.append( delta.seconds + delta.microseconds / mu_second) if __name__ == '__main__': print('============================================') print('Warning: this is going to take a looong time') print('============================================') n = 10 step = 10000 n_samples = 10000 dim = 10 n_classes = 10 for i in range(n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') n_samples += step X = np.random.randn(n_samples, dim) Y = np.random.randint(0, n_classes, (n_samples,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(n_samples) bench_scikit_tree_regressor(X, Y) xx = range(0, n * step, step) pl.figure('scikit-learn tree benchmark results') pl.subplot(211) pl.title('Learning with varying number of samples') pl.plot(xx, scikit_classifier_results, 'g-', label='classification') pl.plot(xx, scikit_regressor_results, 'r-', label='regression') pl.legend(loc='upper left') pl.xlabel('number of samples') pl.ylabel('Time (s)') scikit_classifier_results = [] scikit_regressor_results = [] n = 10 step = 500 start_dim = 500 n_classes = 10 dim = start_dim for i in range(0, n): print('============================================') print('Entering iteration %s of %s' % (i, n)) print('============================================') dim += step X = np.random.randn(100, dim) Y = np.random.randint(0, n_classes, (100,)) bench_scikit_tree_classifier(X, Y) Y = np.random.randn(100) bench_scikit_tree_regressor(X, Y) xx = np.arange(start_dim, start_dim + n * step, step) pl.subplot(212) pl.title('Learning in high dimensional spaces') pl.plot(xx, scikit_classifier_results, 'g-', label='classification') pl.plot(xx, scikit_regressor_results, 'r-', label='regression') pl.legend(loc='upper left') pl.xlabel('number of dimensions') pl.ylabel('Time (s)') pl.axis('tight') pl.show() scikit-learn-0.14.1/doc/000077500000000000000000000000001220075675300147365ustar00rootroot00000000000000scikit-learn-0.14.1/doc/Makefile000066400000000000000000000074131220075675300164030ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml pickle json htmlhelp qthelp latex latexpdf changes linkcheck doctest all: html-noplot help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* -rm -rf auto_examples/ -rm -rf generated/* -rm -rf modules/generated/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable" html-noplot: $(SPHINXBUILD) -D plot_gallery=False -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html/stable @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html/stable." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/scikit-learn.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/scikit-learn.qhc" latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." make -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." download-data: python -c "from sklearn.datasets.lfw import check_fetch_lfw; check_fetch_lfw()" scikit-learn-0.14.1/doc/README000066400000000000000000000033041220075675300156160ustar00rootroot00000000000000Documentation for scikit-learn ------------------------------- This section contains the full manual and web page as displayed in http://scikit-learn.sf.net. To generate the full web page, including the example gallery (this might take a while): make html Or, if you'd rather not build the example gallery: make html-noplot That should create all the doc in directory _build/html To build the PDF manual, run make latexpdf Upload the generated doc to sourceforge --------------------------------------- First of, generate the html documentation:: make html This should create a directory _build/html/stable with the documentation in html format. Now can upload the generated HTML documentation using scp or some other SFTP clients. * Project web Hostname: web.sourceforge.net * Path: htdocs/ * Username: Combine your SourceForge.net Username with your SourceForge.net project UNIX name using a comma ( "," ); see below * Password: Your SourceForge.net Password An example session might look like the following for Username "jsmith" uploading a file for this project, using rsync with the right switch for the permissions: [jsmith@linux ~]$ rsync -rltvz --delete _build/html/stable/ \ > jsmith,scikit-learn@web.sourceforge.net:htdocs/dev -essh Connecting to web.sourceforge.net... The authenticity of host 'web.sourceforge.net (216.34.181.57)' can't be established. RSA key fingerprint is 68:b3:26:02:a0:07:e4:78:d4:ec:7f:2f:6a:4d:32:c5. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'web.sourceforge.net,216.34.181.57' (RSA) to the list of known hosts. jsmith,fooproject@web.sourceforge.net's password: sending incremental file list ... scikit-learn-0.14.1/doc/about.rst000066400000000000000000000122021220075675300165770ustar00rootroot00000000000000 About us ======== .. include:: ../AUTHORS.rst .. _citing-scikit-learn: Citing scikit-learn ------------------- If you use scikit-learn in scientific publication, we would appreciate citations to the following paper: `Scikit-learn: Machine Learning in Python `_, Pedregosa *et al.*, JMLR 12, pp. 2825-2830, 2011. Bibtex entry:: @article{scikit-learn, title={Scikit-learn: Machine Learning in {P}ython}, author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.}, journal={Journal of Machine Learning Research}, volume={12}, pages={2825--2830}, year={2011} } Funding ------- `INRIA `_ actively supports this project. It has provided funding for Fabian Pedregosa to work on this project full time in the period 2010-2012. It also hosts coding sprints and other events. .. image:: images/inria-logo.jpg `Google `_ sponsored David Cournapeau with a Summer of Code Scholarship in the summer of 2007, `Vlad Niculae`_ in 2011, and `Vlad Niculae`_ and Immanuel Bayer in 2012 . It also provided funding for sprints and events around scikit-learn. If you would like to participate in the next Google Summer of code program, please see `this page `_ The `NeuroDebian `_ project providing `Debian `_ packaging and contributions is supported by `Dr. James V. Haxby `_ (`Dartmouth College `_). The `PSF `_ helped find and manage funding for our 2011 Granada sprint. More information can be found `here `_ `tinyclues `_ funded the 2011 international Granada sprint. Donating to the project ~~~~~~~~~~~~~~~~~~~~~~~ If you are interested in donating to the project or to one of our code-sprints, you can use the *Paypal* button below or the `NumFOCUS Donations Page `_ (if you use the latter, please indicate that you are donating for the scikit-learn project). All donations will be handled by `NumFOCUS `_, a non-profit-organization which is managed by a board of `Scipy community members `_. NumFOCUS's mission is to foster scientific computing software, in particular in Python. As a fiscal home of scikit-learn, it ensures that money is available when needed to keep the project funded and available while in compliance with tax regulations. The received donations for the scikit-learn project mostly will go towards covering travel-expenses for code sprints, as well as towards the organization budget of the project [#f1]_. .. raw :: html


.. rubric:: Notes .. [#f1] Regarding the organization budget in particular, we might use some of the donated funds to pay for other project expenses such as DNS, hosting or continuous integration services. The 2013' Paris international sprint ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |center-div| |telecom| |tinyclues| |afpy| |FNRS| |end-div| .. |center-div| raw:: html
.. |telecom| image:: http://f.hypotheses.org/wp-content/blogs.dir/331/files/2011/03/Logo-TPT.jpg :width: 150px :target: http://www.telecom-paristech.fr/ .. |tinyclues| image:: http://www.tinyclues.com/item/50b77d01e4b0bff132989dfd?format=original :width: 150px :target: http://www.tinyclues.fr .. |afpy| image:: http://www.afpy.org/logo.png :width: 150px :target: http://www.afpy.org .. |SGR| image:: http://www.svi.cnrs-bellevue.fr/wikimedia/images/Logo_svi_inp.png :width: 150px :target: http://www.svi.cnrs-bellevue.fr .. |FNRS| image:: http://www.fnrs.be/uploaddocs/images/COMMUNIQUER/FRS-FNRS_rose_transp.png :width: 150px :target: http://www.frs-fnrs.be/ .. figure:: http://sites.uclouvain.be/dysco/pmwiki/uploads/Main/dysco.gif :width: 150px :target: http://sites.uclouvain.be/dysco/ IAP VII/19 - DYSCO .. |end-div| raw:: html
*For more information on this sprint, see* `here `_ scikit-learn-0.14.1/doc/conf.py000066400000000000000000000163011220075675300162360ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # scikit-learn documentation build configuration file, created by # sphinx-quickstart on Fri Jan 8 09:13:42 2010. # # This file is execfile()d with the current directory set to its containing # dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os from sklearn.externals.six import u # If extensions (or modules to document with autodoc) are in another # directory, add these directories to sys.path here. If the directory # is relative to the documentation root, use os.path.abspath to make it # absolute, like shown here. sys.path.insert(0, os.path.abspath('sphinxext')) # -- General configuration --------------------------------------------------- # Try to override the matplotlib configuration as early as possible try: import gen_rst except: pass # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['gen_rst', 'sphinx.ext.autodoc', 'sphinx.ext.autosummary', 'sphinx.ext.pngmath', 'numpy_ext.numpydoc' ] autosummary_generate = True autodoc_default_flags = ['members', 'inherited-members'] # Add any paths that contain templates here, relative to this directory. templates_path = ['templates'] # generate autosummary even if no references autosummary_generate = True # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8' # Generate the plots for the gallery plot_gallery = True # The master toctree document. master_doc = 'index' # General information about the project. project = u('scikit-learn') copyright = u('2010 - 2013, scikit-learn developers (BSD License)') # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '0.14' # The full version, including alpha/beta/rc tags. import sklearn release = sklearn.__version__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of documents that shouldn't be included in the build. #unused_docs = [] # List of directories, relative to source directory, that shouldn't be # searched for source files. exclude_trees = ['_build', 'templates', 'includes'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. add_function_parentheses = False # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. Major themes that come with # Sphinx are currently 'default' and 'sphinxdoc'. html_theme = 'scikit-learn' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. html_theme_options = {'oldversion': False, 'collapsiblesidebar': True, 'google_analytics': True, 'surveybanner': False, 'sprintbanner' : True} # Add any paths that contain custom themes here, relative to this directory. html_theme_path = ['themes'] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. html_short_title = 'scikit-learn' # The name of an image file (relative to this directory) to place at the top # of the sidebar. html_logo = 'logos/scikit-learn-logo-small.png' # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. html_favicon = 'logos/favicon.ico' # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['images'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. html_use_modindex = False # If false, no index is generated. html_use_index = False # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # If nonempty, this is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = '' # Output file base name for HTML help builder. htmlhelp_basename = 'scikit-learndoc' # -- Options for LaTeX output ------------------------------------------------ # The paper size ('letter' or 'a4'). #latex_paper_size = 'letter' # The font size ('10pt', '11pt' or '12pt'). #latex_font_size = '10pt' # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass # [howto/manual]). latex_documents = [('index', 'user_guide.tex', u('scikit-learn user guide'), u('scikit-learn developers'), 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. latex_logo = "logos/scikit-learn-logo.png" # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # Additional stuff for the LaTeX preamble. latex_preamble = r""" \usepackage{amsmath}\usepackage{amsfonts}\usepackage{bm}\usepackage{morefloats} \usepackage{enumitem} \setlistdepth{10} """ # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_use_modindex = True trim_doctests_flags = True scikit-learn-0.14.1/doc/data_transforms.rst000066400000000000000000000004101220075675300206520ustar00rootroot00000000000000.. include:: includes/big_toc_css.rst .. _data-transforms: Dataset transformations ----------------------- .. toctree:: modules/feature_extraction modules/preprocessing modules/kernel_approximation modules/random_projection modules/metrics scikit-learn-0.14.1/doc/datasets/000077500000000000000000000000001220075675300165465ustar00rootroot00000000000000scikit-learn-0.14.1/doc/datasets/covtype.rst000066400000000000000000000014071220075675300207730ustar00rootroot00000000000000 .. _covtype: Forest covertypes ================= The samples in this dataset correspond to 30×30m patches of forest in the US, collected for the task of predicting each patch's cover type, i.e. the dominant species of tree. There are seven covertypes, making this a multiclass classification problem. Each sample has 54 features, described on the `dataset's homepage `_. Some of the features are boolean indicators, while others are discrete or continuous measurements. :func:`sklearn.datasets.fetch_covtype` will load the covertype dataset; it returns a dictionary-like object with the feature matrix in the ``data`` member and the target values in ``target``. The dataset will be downloaded from the web if necessary. scikit-learn-0.14.1/doc/datasets/index.rst000066400000000000000000000131331220075675300204100ustar00rootroot00000000000000.. _datasets: ========================= Dataset loading utilities ========================= .. currentmodule:: sklearn.datasets The ``sklearn.datasets`` package embeds some small toy datasets as introduced in the :ref:`Getting Started ` section. To evaluate the impact of the scale of the dataset (``n_samples`` and ``n_features``) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that comes from the 'real world'. General dataset API =================== There are three distinct kinds of dataset interfaces for different types of datasets. The simplest one is the interface for sample images, which is described below in the :ref:`sample_images` section. The dataset generation functions and the svmlight loader share a simplistic interface, returning a tuple ``(X, y)`` consisting of a n_samples x n_features numpy array X and an array of length n_samples containing the targets y. The toy datasets as well as the 'real world' datasets and the datasets fetched from mldata.org have more sophisticated structure. These functions return a dictionary-like object holding at least two items: an array of shape ``n_samples`` * `` n_features`` with key ``data`` (except for 20newsgroups) and a NumPy array of length ``n_features``, containing the target values, with key ``target``. The datasets also contain a description in ``DESCR`` and some contain ``feature_names`` and ``target_names``. See the dataset descriptions below for details. Toy datasets ============ scikit-learn comes with a few small standard datasets that do not require to download any file from some external website. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_boston load_iris load_diabetes load_digits load_linnerud These datasets are useful to quickly illustrate the behavior of the various algorithms implemented in the scikit. They are however often too small to be representative of real world machine learning tasks. .. _sample_images: Sample images ============= The scikit also embed a couple of sample JPEG images published under Creative Commons license by their authors. Those image can be useful to test algorithms and pipeline on 2D data. .. autosummary:: :toctree: ../modules/generated/ :template: function.rst load_sample_images load_sample_image .. image:: ../auto_examples/cluster/images/plot_color_quantization_1.png :target: ../auto_examples/cluster/plot_color_quantization.html :scale: 30 :align: right .. warning:: The default coding of images is based on the ``uint8`` dtype to spare memory. Often machine learning algorithms work best if the input is converted to a floating point representation first. Also, if you plan to use ``pylab.imshow`` don't forget to scale to the range 0 - 1 as done in the following example. .. topic:: Examples: * :ref:`example_cluster_plot_color_quantization.py` .. _sample_generators: Sample generators ================= In addition, scikit-learn includes various random sample generators that can be used to build artificial datasets of controlled size and complexity. .. image:: ../auto_examples/datasets/images/plot_random_dataset_1.png :target: ../auto_examples/datasets/plot_random_dataset.html :scale: 50 :align: center .. autosummary:: :toctree: ../modules/generated/ :template: function.rst make_classification make_multilabel_classification make_regression make_blobs make_friedman1 make_friedman2 make_friedman3 make_hastie_10_2 make_low_rank_matrix make_sparse_coded_signal make_sparse_uncorrelated make_spd_matrix make_swiss_roll make_s_curve make_sparse_spd_matrix make_biclusters make_checkerboard .. _libsvm_loader: Datasets in svmlight / libsvm format ==================================== scikit-learn includes utility functions for loading datasets in the svmlight / libsvm format. In this format, each line takes the form ``