pax_global_header00006660000000000000000000000064125155732370014524gustar00rootroot0000000000000052 comment=b8da72c50d20b6f8c0df2c2f39620715b08ddd32 simplebayes-1.5.7/000077500000000000000000000000001251557323700140535ustar00rootroot00000000000000simplebayes-1.5.7/.gitignore000066400000000000000000000012741251557323700160470ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .cache nosetests.xml coverage.xml # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ .idea .vagrant MANIFESTsimplebayes-1.5.7/.travis.yml000066400000000000000000000004541251557323700161670ustar00rootroot00000000000000language: python cache: - apt - pip sudo: false python: - "2.7" - "3.4" install: - pip install -r setup/requirements.dev.txt script: - nosetests tests/test.py --with-coverage --cover-package=simplebayes --cover-min-percentage 100 - flake8 simplebayes tests - pylint simplebayes tests simplebayes-1.5.7/LICENSE000066400000000000000000000020671251557323700150650ustar00rootroot00000000000000The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. simplebayes-1.5.7/MANIFEST.in000066400000000000000000000000421251557323700156050ustar00rootroot00000000000000include README.rst include LICENSEsimplebayes-1.5.7/README.rst000066400000000000000000000066711251557323700155540ustar00rootroot00000000000000simplebayes =========== A memory-based, optional-persistence naïve bayesian text classifier. -------------------------------------------------------------------- :: This work is heavily inspired by the python "redisbayes" module found here: [https://github.com/jart/redisbayes] and [https://pypi.python.org/pypi/redisbayes] I've elected to write this to alleviate the network/time requirements when using the bayesian classifier to classify large sets of text, or when attempting to train with very large sets of sample data. Build Status ------------ .. image:: https://travis-ci.org/hickeroar/simplebayes.svg?branch=master .. image:: https://img.shields.io/badge/coverage-100%-brightgreen.svg?style=flat .. image:: https://img.shields.io/badge/pylint-10.00/10-brightgreen.svg?style=flat .. image:: https://img.shields.io/badge/flake8-passing-brightgreen.svg?style=flat Installation ------------ :: sudo pip install simplebayes Basic Usage ----------- .. code-block:: python import simplebayes bayes = simplebayes.SimpleBayes() bayes.train('good', 'sunshine drugs love sex lobster sloth') bayes.train('bad', 'fear death horror government zombie') assert bayes.classify('sloths are so cute i love them') == 'good' assert bayes.classify('i would fear a zombie and love the government') == 'bad' print bayes.score('i fear zombies and love the government') bayes.untrain('bad', 'fear death') assert bayes.tally('bad') == 3 Cache Usage ----------- .. code-block:: python import simplebayes bayes = simplebayes.SimpleBayes(cache_path='/my/cache/') # Cache file is '/my/cache/_simplebayes.pickle' # Default cache_path is '/tmp/' if not bayes.cache_train(): # Unable to load cache data, so we're training it bayes.train('good', 'sunshine drugs love sex lobster sloth') bayes.train('bad', 'fear death horror government zombie') # Saving the cache so next time the training won't be needed bayes.persist_cache() Tokenizer Override ------------------ .. code-block:: python import simplebayes def my_tokenizer(sample): return sample.split() bayes = simplebayes.SimpleBayes(tokenizer=my_tokenizer) License ------- :: The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. API Documentation ----------------- ``_simplebayes-1.5.7/docs/000077500000000000000000000000001251557323700150035ustar00rootroot00000000000000simplebayes-1.5.7/docs/Makefile000066400000000000000000000164051251557323700164510ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " applehelp to make an Apple Help Book" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" @echo " coverage to run coverage check of the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/simplebayes.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/simplebayes.qhc" applehelp: $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp @echo @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." @echo "N.B. You won't be able to view it unless you put it in" \ "~/Library/Documentation/Help or install it in your application" \ "bundle." devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/simplebayes" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/simplebayes" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." coverage: $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage @echo "Testing of coverage in the sources finished, look at the " \ "results in $(BUILDDIR)/coverage/python.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." simplebayes-1.5.7/docs/conf.py000066400000000000000000000260501251557323700163050ustar00rootroot00000000000000#!/usr/bin/env python3 # -*- coding: utf-8 -*- # # simplebayes documentation build configuration file, created by # sphinx-quickstart on Wed Apr 8 21:11:57 2015. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os import shlex # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.todo', 'sphinx.ext.viewcode', ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ['.rst', '.md'] source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = 'simplebayes' copyright = '2015, Ryan Vennell' author = 'Ryan Vennell' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '' # The full version, including alpha/beta/rc tags. release = '' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = 'en' # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = True # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'sphinx_rtd_theme' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Language to be used for generating the HTML full-text search index. # Sphinx supports the following languages: # 'da', 'de', 'en', 'es', 'fi', 'fr', 'h', 'it', 'ja' # 'nl', 'no', 'pt', 'ro', 'r', 'sv', 'tr' #html_search_language = 'en' # A dictionary with options for the search language support, empty by default. # Now only 'ja' uses this config value #html_search_options = {'type': 'default'} # The name of a javascript file (relative to the configuration directory) that # implements a search results scorer. If empty, the default will be used. #html_search_scorer = 'scorer.js' # Output file base name for HTML help builder. htmlhelp_basename = 'simplebayesdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', # Latex figure (float) alignment #'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'simplebayes.tex', 'simplebayes Documentation', 'Author', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ (master_doc, 'simplebayes', 'simplebayes Documentation', [author], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'simplebayes', 'simplebayes Documentation', author, 'simplebayes', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False # -- Options for Epub output ---------------------------------------------- # Bibliographic Dublin Core info. epub_title = project epub_author = author epub_publisher = author epub_copyright = copyright # The basename for the epub file. It defaults to the project name. #epub_basename = project # The HTML theme for the epub output. Since the default themes are not optimized # for small screen space, using the same theme for HTML and epub output is # usually not wise. This defaults to 'epub', a theme designed to save visual # space. #epub_theme = 'epub' # The language of the text. It defaults to the language option # or 'en' if the language is not set. #epub_language = '' # The scheme of the identifier. Typical schemes are ISBN or URL. #epub_scheme = '' # The unique identifier of the text. This can be a ISBN number # or the project homepage. #epub_identifier = '' # A unique identification for the text. #epub_uid = '' # A tuple containing the cover image and cover page html template filenames. #epub_cover = () # A sequence of (type, uri, title) tuples for the guide element of content.opf. #epub_guide = () # HTML files that should be inserted before the pages created by sphinx. # The format is a list of tuples containing the path and title. #epub_pre_files = [] # HTML files shat should be inserted after the pages created by sphinx. # The format is a list of tuples containing the path and title. #epub_post_files = [] # A list of files that should not be packed into the epub file. epub_exclude_files = ['search.html'] # The depth of the table of contents in toc.ncx. #epub_tocdepth = 3 # Allow duplicate toc entries. #epub_tocdup = True # Choose between 'default' and 'includehidden'. #epub_tocscope = 'default' # Fix unsupported image types using the Pillow. #epub_fix_images = False # Scale large images. #epub_max_image_width = 0 # How to display URL addresses: 'footnote', 'no', or 'inline'. #epub_show_urls = 'inline' # If false, no index is generated. #epub_use_index = True simplebayes-1.5.7/docs/index.rst000066400000000000000000000007051251557323700166460ustar00rootroot00000000000000.. simplebayes documentation master file, created by sphinx-quickstart on Wed Apr 8 21:11:57 2015. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. Welcome to simplebayes's documentation! ======================================= Contents: .. toctree:: :maxdepth: 4 simplebayes Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` simplebayes-1.5.7/docs/make.bat000066400000000000000000000161261251557323700164160ustar00rootroot00000000000000@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=_build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . set I18NSPHINXOPTS=%SPHINXOPTS% . if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. xml to make Docutils-native XML files echo. pseudoxml to make pseudoxml-XML files for display purposes echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled echo. coverage to run coverage check of the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) REM Check if sphinx-build is available and fallback to Python version if any %SPHINXBUILD% 2> nul if errorlevel 9009 goto sphinx_python goto sphinx_ok :sphinx_python set SPHINXBUILD=python -m sphinx.__init__ %SPHINXBUILD% 2> nul if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) :sphinx_ok if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\simplebayes.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\simplebayes.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdf" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdfja" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf-ja cd %~dp0 echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) if "%1" == "coverage" ( %SPHINXBUILD% -b coverage %ALLSPHINXOPTS% %BUILDDIR%/coverage if errorlevel 1 exit /b 1 echo. echo.Testing of coverage in the sources finished, look at the ^ results in %BUILDDIR%/coverage/python.txt. goto end ) if "%1" == "xml" ( %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml if errorlevel 1 exit /b 1 echo. echo.Build finished. The XML files are in %BUILDDIR%/xml. goto end ) if "%1" == "pseudoxml" ( %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml if errorlevel 1 exit /b 1 echo. echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. goto end ) :end simplebayes-1.5.7/docs/simplebayes.categories.rst000066400000000000000000000002351251557323700221760ustar00rootroot00000000000000simplebayes.categories module ============================= .. automodule:: simplebayes.categories :members: :undoc-members: :show-inheritance: simplebayes-1.5.7/docs/simplebayes.category.rst000066400000000000000000000002271251557323700216670ustar00rootroot00000000000000simplebayes.category module =========================== .. automodule:: simplebayes.category :members: :undoc-members: :show-inheritance: simplebayes-1.5.7/docs/simplebayes.rst000066400000000000000000000003671251557323700200600ustar00rootroot00000000000000simplebayes package =================== Submodules ---------- .. toctree:: simplebayes.categories simplebayes.category Module contents --------------- .. automodule:: simplebayes :members: :undoc-members: :show-inheritance: simplebayes-1.5.7/pylintrc000066400000000000000000000232501251557323700156440ustar00rootroot00000000000000[MASTER] # Specify a configuration file. #rcfile= # Python code to execute, usually for sys.path manipulation such as # pygtk.require(). #init-hook= # Profiled execution. profile=no # Add files or directories to the blacklist. They should be base names, not # paths. ignore=CVS # Pickle collected data for later comparisons. persistent=no # List of plugins (as comma separated values of python modules names) to load, # usually to register additional checkers. load-plugins= [MESSAGES CONTROL] # Enable the message, report, category or checker with the given id(s). You can # either give multiple identifier separated by comma (,) or put this option # multiple time. See also the "--disable" option for examples. #enable= # Disable the message, report, category or checker with the given id(s). You # can either give multiple identifiers separated by comma (,) or put this # option multiple times (only on the command line, not in the configuration # file where it should appear only once).You can also use "--disable=all" to # disable everything first and then reenable specific checks. For example, if # you want to run only the similarities checker, you can use "--disable=all # --enable=similarities". If you want to run only the classes checker, but have # no Warning level messages displayed, use"--disable=all --enable=classes # --disable=W" # disable=W0232,R0903,I0011,W0703 disable=locally-disabled [REPORTS] # Set the output format. Available formats are text, parseable, colorized, msvs # (visual studio) and html. You can also give a reporter class, eg # mypackage.mymodule.MyReporterClass. output-format=text # Put messages in a separate file for each module / package specified on the # command line instead of printing them on stdout. Reports (if any) will be # written in a file name "pylint_global.[txt|html]". files-output=no # Tells whether to display a full report or only the messages reports=yes # Python expression which should return a note less than 10 (10 is the highest # note). You have access to the variables errors warning, statement which # respectively contain the number of errors / warnings messages and the total # number of statements analyzed. This is used by the global evaluation report # (RP0004). evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10) # Add a comment according to your evaluation note. This is used by the global # evaluation report (RP0004). comment=no # Template used to display messages. This is a python new-style format string # used to format the message information. See doc for all details #msg-template= [SIMILARITIES] # Minimum lines number of a similarity. min-similarity-lines=4 # Ignore comments when computing similarities. ignore-comments=yes # Ignore docstrings when computing similarities. ignore-docstrings=yes # Ignore imports when computing similarities. ignore-imports=no [TYPECHECK] # Tells whether missing members accessed in mixin class should be ignored. A # mixin class is detected if its name ends with "mixin" (case insensitive). ignore-mixin-members=yes # List of module names for which member attributes should not be checked # (useful for modules/projects where namespaces are manipulated during runtime # and thus extisting member attributes cannot be deduced by static analysis ignored-modules= # List of classes names for which member attributes should not be checked # (useful for classes with attributes dynamically set). ignored-classes=SQLObject # When zope mode is activated, add a predefined set of Zope acquired attributes # to generated-members. zope=no # List of members which are set dynamically and missed by pylint inference # system, and so shouldn't trigger E0201 when accessed. Python regular # expressions are accepted. generated-members=REQUEST,acl_users,aq_parent [FORMAT] # Maximum number of characters on a single line. max-line-length=80 # Regexp for a line that is allowed to be longer than the limit. ignore-long-lines=^\s*(# )??$ # Allow the body of an if to be on the same line as the test if there is no # else. single-line-if-stmt=no # List of optional constructs for which whitespace checking is disabled no-space-check=trailing-comma,dict-separator # Maximum number of lines in a module max-module-lines=1000 # String used as indentation unit. This is usually " " (4 spaces) or "\t" (1 # tab). indent-string=' ' [VARIABLES] # Tells whether we should check for unused import in __init__ files. init-import=no # A regular expression matching the name of dummy variables (i.e. expectedly # not used). dummy-variables-rgx=_|dummy # List of additional names supposed to be defined in builtins. Remember that # you should avoid to define new builtins when possible. additional-builtins= [LOGGING] # Logging modules to check that the string format arguments are in logging # function parameter format logging-modules=logging [MISCELLANEOUS] # List of note tags to take in consideration, separated by a comma. notes=FIXME,XXX,TODO [BASIC] # Required attributes for module, separated by a comma required-attributes= # List of builtins function names that should not be used, separated by a comma bad-functions=map,filter,apply,input # Good variable names which should always be accepted, separated by a comma good-names=i,j,k,ex,Run,_ # Bad variable names which should always be refused, separated by a comma bad-names=foo,bar,baz,toto,tutu,tata # Colon-delimited sets of names that determine each other's naming style when # the name regexes allow several styles. name-group= # Include a hint for the correct naming format with invalid-name include-naming-hint=no # Regular expression matching correct function names function-rgx=[a-z_][a-z0-9_]{2,30}$ # Naming hint for function names function-name-hint=[a-z_][a-z0-9_]{2,30}$ # Regular expression matching correct variable names variable-rgx=[a-z_][a-z0-9_]{2,30}$ # Naming hint for variable names variable-name-hint=[a-z_][a-z0-9_]{2,30}$ # Regular expression matching correct constant names const-rgx=(([A-Z_][A-Z0-9_]*)|(__.*__))$ # Naming hint for constant names const-name-hint=(([A-Z_][A-Z0-9_]*)|(__.*__))$ # Regular expression matching correct attribute names attr-rgx=[a-z_][a-z0-9_]{2,30}$ # Naming hint for attribute names attr-name-hint=[a-z_][a-z0-9_]{2,30}$ # Regular expression matching correct argument names argument-rgx=[a-z_][a-z0-9_]{2,30}$ # Naming hint for argument names argument-name-hint=[a-z_][a-z0-9_]{2,30}$ # Regular expression matching correct class attribute names class-attribute-rgx=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$ # Naming hint for class attribute names class-attribute-name-hint=([A-Za-z_][A-Za-z0-9_]{2,30}|(__.*__))$ # Regular expression matching correct inline iteration names inlinevar-rgx=[A-Za-z_][A-Za-z0-9_]*$ # Naming hint for inline iteration names inlinevar-name-hint=[A-Za-z_][A-Za-z0-9_]*$ # Regular expression matching correct class names class-rgx=[A-Z_][a-zA-Z0-9]+$ # Naming hint for class names class-name-hint=[A-Z_][a-zA-Z0-9]+$ # Regular expression matching correct module names module-rgx=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$ # Naming hint for module names module-name-hint=(([a-z_][a-z0-9_]*)|([A-Z][a-zA-Z0-9]+))$ # Regular expression matching correct method names method-rgx=[a-z_][a-z0-9_]{2,30}$ # Naming hint for method names method-name-hint=[a-z_][a-z0-9_]{2,30}$ # Regular expression which should only match function or class names that do # not require a docstring. no-docstring-rgx=__.*__ # Minimum line length for functions/classes that require docstrings, shorter # ones are exempt. docstring-min-length=-1 [DESIGN] # Maximum number of arguments for function / method max-args=5 # Argument names that match this expression will be ignored. Default to name # with leading underscore ignored-argument-names=_.* # Maximum number of locals for function / method body max-locals=15 # Maximum number of return / yield for function / method body max-returns=6 # Maximum number of branch for function / method body max-branches=12 # Maximum number of statements in function / method body max-statements=50 # Maximum number of parents for a class (see R0901). max-parents=7 # Maximum number of attributes for a class (see R0902). max-attributes=7 # Minimum number of public methods for a class (see R0903). min-public-methods=2 # Maximum number of public methods for a class (see R0904). max-public-methods=20 [IMPORTS] # Deprecated modules which should not be used, separated by a comma deprecated-modules=regsub,TERMIOS,Bastion,rexec # Create a graph of every (i.e. internal and external) dependencies in the # given file (report RP0402 must not be disabled) import-graph= # Create a graph of external dependencies in the given file (report RP0402 must # not be disabled) ext-import-graph= # Create a graph of internal dependencies in the given file (report RP0402 must # not be disabled) int-import-graph= [CLASSES] # List of interface methods to ignore, separated by a comma. This is used for # instance to not check methods defines in Zope's Interface base class. ignore-iface-methods=isImplementedBy,deferred,extends,names,namesAndDescriptions,queryDescriptionFor,getBases,getDescriptionFor,getDoc,getName,getTaggedValue,getTaggedValueTags,isEqualOrExtendedBy,setTaggedValue,isImplementedByInstancesOf,adaptWith,is_implemented_by # List of method names used to declare (i.e. assign) instance attributes. defining-attr-methods=__init__,__new__,setUp # List of valid names for the first argument in a class method. valid-classmethod-first-arg=cls # List of valid names for the first argument in a metaclass class method. valid-metaclass-classmethod-first-arg=mcs [EXCEPTIONS] # Exceptions that will emit a warning when being caught. Defaults to # "Exception" overgeneral-exceptions=Exception simplebayes-1.5.7/setup.py000066400000000000000000000015471251557323700155740ustar00rootroot00000000000000# coding: utf-8 from setuptools import setup setup ( name = 'simplebayes', version = '1.5.8', url = 'https://github.com/hickeroar/simplebayes', author = 'Ryan Vennell', author_email = 'ryan.vennell@gmail.com', description = 'A memory-based, optional-persistence naïve bayesian text classifier.', long_description = open('README.rst', 'r').read(), license = 'MIT', classifiers = [ 'Development Status :: 5 - Production/Stable', 'Intended Audience :: Developers', 'License :: OSI Approved :: MIT License', 'Programming Language :: Python', 'Programming Language :: Python :: 2', 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.4', 'Topic :: Utilities', ], packages = ['simplebayes'], ) simplebayes-1.5.7/setup/000077500000000000000000000000001251557323700152135ustar00rootroot00000000000000simplebayes-1.5.7/setup/distribute.sh000077500000000000000000000002561251557323700177330ustar00rootroot00000000000000#!/usr/bin/env bash python3 ./setup.py sdist upload python3 ./setup.py bdist_egg upload python ./setup.py bdist_egg upload python3 ./setup.py bdist_wheel --universal upload simplebayes-1.5.7/setup/requirements.dev.txt000066400000000000000000000000411251557323700212470ustar00rootroot00000000000000nose coveralls flake8 mock pylintsimplebayes-1.5.7/simplebayes/000077500000000000000000000000001251557323700163705ustar00rootroot00000000000000simplebayes-1.5.7/simplebayes/__init__.py000066400000000000000000000240511251557323700205030ustar00rootroot00000000000000# coding: utf-8 """ The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ from simplebayes.categories import BayesCategories import pickle import os class SimpleBayes(object): """A memory-based, optional-persistence naïve bayesian text classifier.""" cache_file = '_simplebayes.pickle' def __init__(self, tokenizer=None, cache_path='/tmp/'): """ :param tokenizer: A tokenizer override :type tokenizer: function (optional) :param cache_path: path to data storage :type cache_path: str """ self.categories = BayesCategories() self.tokenizer = tokenizer or SimpleBayes.tokenize_text self.cache_path = cache_path self.probabilities = {} @classmethod def tokenize_text(cls, text): """ Default tokenize method; can be overridden :param text: the text we want to tokenize :type text: str :return: list of tokenized text :rtype: list """ return [w for w in text.split() if len(w) > 2] @classmethod def count_token_occurrences(cls, words): """ Creates a key/value set of word/count for a given sample of text :param words: full list of all tokens, non-unique :type words: list :return: key/value pairs of words and their counts in the list :rtype: dict """ counts = {} for word in words: if word in counts: counts[word] += 1 else: counts[word] = 1 return counts def flush(self): """ Deletes all tokens & categories """ self.categories = BayesCategories() def calculate_category_probability(self): """ Caches the individual probabilities for each category """ total_tally = 0.0 probs = {} for category, bayes_category in \ self.categories.get_categories().items(): count = bayes_category.get_tally() total_tally += count probs[category] = count # Calculating the probability for category, count in probs.items(): if total_tally > 0: probs[category] = float(count)/float(total_tally) else: probs[category] = 0.0 for category, probability in probs.items(): self.probabilities[category] = { # Probability that any given token is of this category 'prc': probability, # Probability that any given token is not of this category 'prnc': sum(probs.values()) - probability } def train(self, category, text): """ Trains a category with a sample of text :param category: the name of the category we want to train :type category: str :param text: the text we want to train the category with :type text: str """ try: bayes_category = self.categories.get_category(category) except KeyError: bayes_category = self.categories.add_category(category) tokens = self.tokenizer(str(text)) occurrence_counts = self.count_token_occurrences(tokens) for word, count in occurrence_counts.items(): bayes_category.train_token(word, count) # Updating our per-category overall probabilities self.calculate_category_probability() def untrain(self, category, text): """ Untrains a category with a sample of text :param category: the name of the category we want to train :type category: str :param text: the text we want to untrain the category with :type text: str """ try: bayes_category = self.categories.get_category(category) except KeyError: return tokens = self.tokenizer(str(text)) occurance_counts = self.count_token_occurrences(tokens) for word, count in occurance_counts.items(): bayes_category.untrain_token(word, count) # Updating our per-category overall probabilities self.calculate_category_probability() def classify(self, text): """ Chooses the highest scoring category for a sample of text :param text: sample text to classify :type text: str :return: the "winning" category :rtype: str """ score = self.score(text) if not score: return None return sorted(score.items(), key=lambda v: v[1])[-1][0] def score(self, text): """ Scores a sample of text :param text: sample text to score :type text: str :return: dict of scores per category :rtype: dict """ occurs = self.count_token_occurrences(self.tokenizer(text)) scores = {} for category in self.categories.get_categories().keys(): scores[category] = 0 categories = self.categories.get_categories().items() for word, count in occurs.items(): token_scores = {} # Adding up individual token scores for category, bayes_category in categories: token_scores[category] = \ float(bayes_category.get_token_count(word)) # We use this to get token-in-category probabilities token_tally = sum(token_scores.values()) # If this token isn't found anywhere its probability is 0 if token_tally == 0.0: continue # Calculating bayes probabiltity for this token # http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering for category, token_score in token_scores.items(): # Bayes probability * the number of occurances of this token scores[category] += count * \ self.calculate_bayesian_probability( category, token_score, token_tally ) # Removing empty categories from the results final_scores = {} for category, score in scores.items(): if score > 0: final_scores[category] = score return final_scores def calculate_bayesian_probability(self, cat, token_score, token_tally): """ Calculates the bayesian probability for a given token/category :param cat: The category we're scoring for this token :type cat: str :param token_score: The tally of this token for this category :type token_score: float :param token_tally: The tally total for this token from all categories :type token_tally: float :return: bayesian probability :rtype: float """ # P that any given token IS in this category prc = self.probabilities[cat]['prc'] # P that any given token is NOT in this category prnc = self.probabilities[cat]['prnc'] # P that this token is NOT of this category prtnc = (token_tally - token_score) / token_tally # P that this token IS of this category prtc = token_score / token_tally # Assembling the parts of the bayes equation numerator = (prtc * prc) denominator = (numerator + (prtnc * prnc)) # Returning the calculated bayes probability unless the denom. is 0 return numerator / denominator if denominator != 0.0 else 0.0 def tally(self, category): """ Gets the tally for a requested category :param category: The category we want a tally for :type category: str :return: tally for a given category :rtype: int """ try: bayes_category = self.categories.get_category(category) except KeyError: return 0 return bayes_category.get_tally() def get_cache_location(self): """ Gets the location of the cache file :return: the location of the cache file :rtype: string """ filename = self.cache_path if \ self.cache_path[-1:] == '/' else \ self.cache_path + '/' filename += self.cache_file return filename def cache_persist(self): """ Saves the current trained data to the cache. This is initiated by the program using this module """ filename = self.get_cache_location() pickle.dump(self.categories, open(filename, 'wb')) def cache_train(self): """ Loads the data for this classifier from a cache file :return: whether or not we were successful :rtype: bool """ filename = self.get_cache_location() if not os.path.exists(filename): return False categories = pickle.load(open(filename, 'rb')) assert isinstance(categories, BayesCategories), \ "Cache data is either corrupt or invalid" self.categories = categories # Updating our per-category overall probabilities self.calculate_category_probability() return True simplebayes-1.5.7/simplebayes/categories.py000066400000000000000000000040661251557323700210750ustar00rootroot00000000000000""" The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ from simplebayes.category import BayesCategory class BayesCategories(object): """Acts as a container for various bayes trained categories of content""" def __init__(self): self.categories = {} def add_category(self, name): """ Adds a bayes category that we can later train :param name: name of the category :type name: str :return: the requested category :rtype: BayesCategory """ category = BayesCategory(name) self.categories[name] = category return category def get_category(self, name): """ Returns the expected category. Will KeyError if non existant :param name: name of the category :type name: str :return: the requested category :rtype: BayesCategory """ return self.categories[name] def get_categories(self): """ :return: dict of all categories :rtype: dict """ return self.categories simplebayes-1.5.7/simplebayes/category.py000066400000000000000000000055071251557323700205660ustar00rootroot00000000000000""" The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ class BayesCategory(object): """ Represents a trainable category of content for bayesian classification """ def __init__(self, name): """ :param name: The name of the category we're creating :type name: str """ self.name = name self.tokens = {} self.tally = 0 def train_token(self, word, count): """ Trains a particular token (increases the weight/count of it) :param word: the token we're going to train :type word: str :param count: the number of occurances in the sample :type count: int """ if word not in self.tokens: self.tokens[word] = 0 self.tokens[word] += count self.tally += count def untrain_token(self, word, count): """ Untrains a particular token (decreases the weight/count of it) :param word: the token we're going to train :type word: str :param count: the number of occurances in the sample :type count: int """ if word not in self.tokens: return # If we're trying to untrain more tokens than we have, we end at 0 count = min(count, self.tokens[word]) self.tokens[word] -= count self.tally -= count def get_token_count(self, word): """ Gets the count assosicated with a provided token/word :param word: the token we're getting the weight of :type word: str :return: the weight/count of the token :rtype: int """ return self.tokens.get(word, 0) def get_tally(self): """ Gets the tally of all types :return: The total number of tokens :rtype: int """ return self.tally simplebayes-1.5.7/tests/000077500000000000000000000000001251557323700152155ustar00rootroot00000000000000simplebayes-1.5.7/tests/__init__.py000066400000000000000000000234671251557323700173420ustar00rootroot00000000000000""" The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ # pylint: disable=invalid-name,missing-docstring,no-self-use from simplebayes import SimpleBayes from simplebayes.categories import BayesCategories import unittest import mock import pickle import os try: import __builtin__ as builtins except ImportError: # pylint: disable=import-error import builtins class SimpleBayesTests(unittest.TestCase): def test_tokenizer(self): sb = SimpleBayes() result = sb.tokenizer('hello world') self.assertEqual(result, ['hello', 'world']) def test_count_token_occurrences(self): sb = SimpleBayes() result = sb.count_token_occurrences(['hello', 'world', 'hello']) self.assertEqual( result, { 'hello': 2, 'world': 1 } ) def test_flush_and_tally(self): sb = SimpleBayes() sb.train('foo', 'hello world hello') self.assertEqual(sb.tally('foo'), 3) sb.flush() self.assertEqual(sb.tally('foo'), 0) def test_untrain(self): sb = SimpleBayes() sb.train('foo', 'hello world hello') self.assertEqual(sb.tally('foo'), 3) self.assertEqual(sb.tally('bar'), 0) sb.untrain('bar', 'for bar baz') self.assertEqual(sb.tally('foo'), 3) self.assertEqual(sb.tally('bar'), 0) sb.untrain('foo', 'hello world') self.assertEqual(sb.tally('foo'), 1) @mock.patch.object(BayesCategories, 'get_category') # pylint: disable=no-self-use def test_train_with_existing_category(self, get_category_mock): cat_mock = mock.MagicMock() cat_mock.train_token.return_value = None get_category_mock.return_value = cat_mock sb = SimpleBayes() sb.train('foo', 'hello world hello') get_category_mock.assert_called_once_with('foo') cat_mock.train_token.assert_any_call('hello', 2) cat_mock.train_token.assert_any_call('world', 1) @mock.patch.object(BayesCategories, 'get_category') @mock.patch.object(BayesCategories, 'add_category') # pylint: disable=no-self-use def test_train_with_new_category( self, add_category_mock, get_category_mock ): cat_mock = mock.MagicMock() cat_mock.train_token.return_value = None get_category_mock.side_effect = KeyError() add_category_mock.return_value = cat_mock sb = SimpleBayes() sb.train('foo', 'hello world hello') add_category_mock.assert_called_with('foo') cat_mock.train_token.assert_any_call('hello', 2) cat_mock.train_token.assert_any_call('world', 1) @mock.patch.object(BayesCategories, 'get_categories') def test_classify(self, get_categories_mock): cat1_mock = mock.MagicMock() cat1_mock.get_token_count.return_value = 2 cat1_mock.get_tally.return_value = 8 cat2_mock = mock.MagicMock() cat2_mock.get_token_count.return_value = 4 cat2_mock.get_tally.return_value = 32 get_categories_mock.return_value = { 'foo': cat1_mock, 'bar': cat2_mock } sb = SimpleBayes() sb.calculate_category_probability() result = sb.classify('hello world') self.assertEqual('bar', result) assert 3 == get_categories_mock.call_count, \ get_categories_mock.call_count cat1_mock.get_token_count.assert_any_call('hello') cat1_mock.get_token_count.assert_any_call('world') cat1_mock.get_tally.assert_called_once_with() cat2_mock.get_token_count.assert_any_call('hello') cat2_mock.get_token_count.assert_any_call('world') cat2_mock.get_tally.assert_called_once_with() @mock.patch.object(BayesCategories, 'get_categories') def test_classify_without_categories(self, get_categories_mock): get_categories_mock.return_value = {} sb = SimpleBayes() result = sb.classify('hello world') self.assertIsNone(result) assert 2 == get_categories_mock.call_count, \ get_categories_mock.call_count @mock.patch.object(BayesCategories, 'get_categories') def test_classify_with_empty_category(self, get_categories_mock): cat_mock = mock.MagicMock() cat_mock.get_tally.return_value = 0 cat_mock.get_token_count.return_value = 0 get_categories_mock.return_value = { 'foo': cat_mock } sb = SimpleBayes() sb.calculate_category_probability() result = sb.classify('hello world') self.assertIsNone(result) assert 3 == get_categories_mock.call_count, \ get_categories_mock.call_count cat_mock.get_tally.assert_called_once_with() @mock.patch.object(BayesCategories, 'get_categories') def test_score(self, get_categories_mock): cat1_mock = mock.MagicMock() cat1_mock.get_token_count.return_value = 2 cat1_mock.get_tally.return_value = 8 cat2_mock = mock.MagicMock() cat2_mock.get_token_count.return_value = 4 cat2_mock.get_tally.return_value = 32 get_categories_mock.return_value = { 'foo': cat1_mock, 'bar': cat2_mock } sb = SimpleBayes() sb.calculate_category_probability() result = sb.score('hello world') self.assertEqual( { 'foo': 0.22222222222222224, 'bar': 1.777777777777778 }, result ) assert 3 == get_categories_mock.call_count, \ get_categories_mock.call_count cat1_mock.get_token_count.assert_any_call('hello') cat1_mock.get_token_count.assert_any_call('world') cat1_mock.get_tally.assert_called_once_with() cat2_mock.get_token_count.assert_any_call('hello') cat2_mock.get_token_count.assert_any_call('world') cat2_mock.get_tally.assert_called_once_with() @mock.patch.object(BayesCategories, 'get_categories') def test_score_with_zero_bayes_denon(self, get_categories_mock): cat1_mock = mock.MagicMock() cat1_mock.get_token_count.return_value = 2 cat1_mock.get_tally.return_value = 8 cat2_mock = mock.MagicMock() cat2_mock.get_token_count.return_value = 4 cat2_mock.get_tally.return_value = 32 get_categories_mock.return_value = { 'foo': cat1_mock, 'bar': cat2_mock } sb = SimpleBayes() sb.calculate_category_probability() sb.probabilities['foo']['prc'] = 0 sb.probabilities['foo']['prnc'] = 0 result = sb.score('hello world') self.assertEqual( { 'bar': 1.777777777777778 }, result ) assert 3 == get_categories_mock.call_count, \ get_categories_mock.call_count cat1_mock.get_token_count.assert_any_call('hello') cat1_mock.get_token_count.assert_any_call('world') cat1_mock.get_tally.assert_called_once_with() cat2_mock.get_token_count.assert_any_call('hello') cat2_mock.get_token_count.assert_any_call('world') cat2_mock.get_tally.assert_called_once_with() @mock.patch.object(SimpleBayes, 'calculate_category_probability') @mock.patch.object(builtins, 'open') @mock.patch.object(pickle, 'load') @mock.patch.object(os.path, 'exists') def test_cache_train(self, exists_mock, load_mock, open_mock, calc_mock): categories = BayesCategories() categories.categories = {'foo': 'bar'} load_mock.return_value = categories open_mock.return_value = 'opened' exists_mock.return_value = True sb = SimpleBayes(cache_path='foo') sb.cache_train() exists_mock.assert_called_once_with('foo/_simplebayes.pickle') open_mock.assert_called_once_with('foo/_simplebayes.pickle', 'rb') load_mock.assert_called_once_with('opened') calc_mock.assert_called_once_with() self.assertEqual(sb.categories, categories) @mock.patch.object(os.path, 'exists') def test_cache_train_with_no_file(self, exists_mock): exists_mock.return_value = False sb = SimpleBayes() result = sb.cache_train() exists_mock.assert_called_once_with('/tmp/_simplebayes.pickle') self.assertFalse(result) @mock.patch.object(builtins, 'open') @mock.patch.object(pickle, 'dump') def test_persist_cache(self, dump_mock, open_mock): open_mock.return_value = 'opened' categories = BayesCategories() categories.categories = {'foo': 'bar'} sb = SimpleBayes() sb.cache_path = '/tmp/' sb.categories = categories sb.cache_persist() open_mock.assert_called_once_with('/tmp/_simplebayes.pickle', 'wb') dump_mock.assert_called_once_with(categories, 'opened') simplebayes-1.5.7/tests/build.sh000077500000000000000000000017431251557323700166600ustar00rootroot00000000000000#!/bin/bash echo echo echo " [simplebayes] Step 1: Executing Unit Tests (Python 2)" echo nosetests tests/test.py --with-coverage --cover-package=simplebayes --cover-min-percentage 100 rm -f .coverage* echo -e "\nExit Code:" $? echo echo " [simplebayes] Step 1: Executing Unit Tests (Python 3)" echo nosetests3 tests/test.py --with-coverage --cover-package=simplebayes --cover-min-percentage 100 rm -f .coverage* echo -e "\nExit Code:" $? echo echo " [simplebayes] Step 2: Executing pep8 and pyflakes Tests (flake8). (Python 2)" echo flake8 simplebayes tests echo "Exit Code:" $? echo echo " [simplebayes] Step 2: Executing pep8 and pyflakes Tests (flake8). (Python 3)" echo flake83 simplebayes tests echo "Exit Code:" $? echo echo " [simplebayes] Step 3: Executing pylint Tests (Python 2)" echo pylint simplebayes tests --reports=no echo "Exit Code:" $? echo echo " [simplebayes] Step 3: Executing pylint Tests (Python 3)" echo pylint3 simplebayes tests --reports=no echo "Exit Code:" $? echo simplebayes-1.5.7/tests/categories.py000066400000000000000000000034621251557323700177210ustar00rootroot00000000000000""" The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ # pylint: disable=invalid-name,missing-docstring,no-self-use from simplebayes.categories import BayesCategories from simplebayes.category import BayesCategory import unittest class BayesCategoriesTests(unittest.TestCase): def test_add_category(self): bc = BayesCategories() bc.add_category('foo') self.assertIn('foo', bc.categories) self.assertIsInstance(bc.categories['foo'], BayesCategory) def test_get_category(self): bc = BayesCategories() bc.add_category('foo') self.assertIsInstance(bc.get_category('foo'), BayesCategory) def test_get_categories(self): bc = BayesCategories() bc.add_category('foo') self.assertEqual(bc.get_categories(), bc.categories) simplebayes-1.5.7/tests/category.py000066400000000000000000000046261251557323700174140ustar00rootroot00000000000000""" The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ # pylint: disable=invalid-name,missing-docstring,no-self-use from simplebayes.category import BayesCategory import unittest class BayesCategoryTests(unittest.TestCase): def test_train_token(self): bc = BayesCategory('foo') bc.train_token('foo', 5) bc.train_token('bar', 7) self.assertEqual(12, bc.tally) self.assertIn('foo', bc.tokens) self.assertEqual(bc.tokens['foo'], 5) def test_untrain_token(self): bc = BayesCategory('foo') bc.train_token('foo', 5) bc.train_token('bar', 7) self.assertEqual(12, bc.tally) self.assertIn('foo', bc.tokens) self.assertIn('bar', bc.tokens) self.assertEqual(bc.tokens['foo'], 5) self.assertEqual(bc.tokens['bar'], 7) bc.untrain_token('foo', 3) bc.untrain_token('bar', 20) bc.untrain_token('baz', 5) self.assertEqual(2, bc.tally) self.assertEqual(bc.tokens['foo'], 2) self.assertEqual(bc.tokens['bar'], 0) def test_get_token_count(self): bc = BayesCategory('foo') bc.train_token('foo', 5) self.assertEqual(bc.get_token_count('foo'), 5) self.assertEqual(bc.get_token_count('bar'), 0) def test_get_tally(self): bc = BayesCategory('foo') bc.train_token('foo', 5) self.assertEqual(5, bc.get_tally()) simplebayes-1.5.7/tests/test.py000066400000000000000000000026301251557323700165470ustar00rootroot00000000000000#!/usr/bin/python """ The MIT License (MIT) Copyright (c) 2015 Ryan Vennell Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. """ # pylint: disable=unused-wildcard-import,wildcard-import,unused-import import simplebayes # flake8: noqa import simplebayes.categories # flake8: noqa import simplebayes.category # flake8: noqa from tests import * # flake8: noqa from tests.categories import * # flake8: noqa from tests.category import * # flake8: noqa