pax_global_header00006660000000000000000000000064136232706240014517gustar00rootroot0000000000000052 comment=0d88dd83e425c4ba381d5b83fe61bfae5bbbd627 bleach-3.1.1/000077500000000000000000000000001362327062400127375ustar00rootroot00000000000000bleach-3.1.1/.gitignore000066400000000000000000000002251362327062400147260ustar00rootroot00000000000000*.pyo *.pyc pip-log.txt .coverage dist *.egg-info build .tox docs/_build/ .cache/ .eggs/ .*env*/ .pytest_cache/ .python-version *~ *.swp __pycache__ bleach-3.1.1/.travis.yml000066400000000000000000000010271362327062400150500ustar00rootroot00000000000000# Note: If you update this, make sure to update tox.ini, too. sudo: false language: python cache: directories: - "~/.cache/pip" python: - "2.7" - "3.4" - "3.5" - "3.6" - "pypy" install: - pip install -U pip setuptools>=18.5 - pip install -r requirements-dev.txt matrix: include: - python: "2.7" env: MODE=lint - python: "2.7" env: MODE=vendorverify - python: "3.4" env: MODE=lint - python: "3.7" sudo: required dist: xenial script: - ./scripts/run_tests.sh $MODE bleach-3.1.1/CHANGES000066400000000000000000000347171362327062400137460ustar00rootroot00000000000000Bleach changes ============== Version 3.1.1 (February 13th, 2020) ----------------------------------- **Security fixes** * ``bleach.clean`` behavior parsing ``noscript`` tags did not match browser behavior. Calls to ``bleach.clean`` allowing ``noscript`` and one or more of the raw text tags (``title``, ``textarea``, ``script``, ``style``, ``noembed``, ``noframes``, ``iframe``, and ``xmp``) were vulnerable to a mutation XSS. This security issue was confirmed in Bleach versions v2.1.4, v3.0.2, and v3.1.0. Earlier versions are probably affected too. Anyone using Bleach <=v3.1.0 is highly encouraged to upgrade. https://bugzilla.mozilla.org/show_bug.cgi?id=1615315 **Backwards incompatible changes** None **Features** None **Bug fixes** None Bleach changes ============== Version 3.1.0 (January 9th, 2019) --------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** * Add ``recognized_tags`` argument to the linkify ``Linker`` class. This fixes issues when linkifying on its own and having some tags get escaped. It defaults to a list of HTML5 tags. Thank you, Chad Birch! (#409) **Bug fixes** * Add ``six>=1.9`` to requirements. Thank you, Dave Shawley (#416) * Fix cases where attribute names could have invalid characters in them. (#419) * Fix problems with ``LinkifyFilter`` not being able to match links across ``&``. (#422) * Fix ``InputStreamWithMemory`` when the ``BleachHTMLParser`` is parsing ``meta`` tags. (#431) * Fix doctests. (#357) Version 3.0.2 (October 11th, 2018) ---------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** None **Bug fixes** * Merge ``Characters`` tokens after sanitizing them. This fixes issues in the ``LinkifyFilter`` where it was only linkifying parts of urls. (#374) Version 3.0.1 (October 9th, 2018) --------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** * Support Python 3.7. It supported Python 3.7 just fine, but we added 3.7 to the list of Python environments we test so this is now officially supported. (#377) **Bug fixes** * Fix ``list`` object has no attribute ``lower`` in ``clean``. (#398) * Fix ``abbr`` getting escaped in ``linkify``. (#400) Version 3.0.0 (October 3rd, 2018) --------------------------------- **Security fixes** None **Backwards incompatible changes** * A bunch of functions were moved from one module to another. These were moved from ``bleach.sanitizer`` to ``bleach.html5lib_shim``: * ``convert_entity`` * ``convert_entities`` * ``match_entity`` * ``next_possible_entity`` * ``BleachHTMLSerializer`` * ``BleachHTMLTokenizer`` * ``BleachHTMLParser`` These functions and classes weren't documented and aren't part of the public API, but people read code and might be using them so we're considering it an incompatible API change. If you're using them, you'll need to update your code. **Features** * Bleach no longer depends on html5lib. html5lib==1.0.1 is now vendored into Bleach. You can remove it from your requirements file if none of your other requirements require html5lib. This means Bleach will now work fine with other libraries that depend on html5lib regardless of what version of html5lib they require. (#386) **Bug fixes** * Fixed tags getting added when using clean or linkify. This was a long-standing regression from the Bleach 2.0 rewrite. (#280, #392) * Fixed ```` getting replaced with a string. Now it gets escaped or stripped depending on whether it's in the allowed tags or not. (#279) Version 2.1.4 (August 16th, 2018) --------------------------------- **Security fixes** None **Backwards incompatible changes** * Dropped support for Python 3.3. (#328) **Features** None **Bug fixes** * Handle ambiguous ampersands in correctly. (#359) Version 2.1.3 (March 5th, 2018) ------------------------------- **Security fixes** * Attributes that have URI values weren't properly sanitized if the values contained character entities. Using character entities, it was possible to construct a URI value with a scheme that was not allowed that would slide through unsanitized. This security issue was introduced in Bleach 2.1. Anyone using Bleach 2.1 is highly encouraged to upgrade. https://bugzilla.mozilla.org/show_bug.cgi?id=1442745 **Backwards incompatible changes** None **Features** None **Bug fixes** * Fixed some other edge cases for attribute URI value sanitizing and improved testing of this code. Version 2.1.2 (December 7th, 2017) ---------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** None **Bug fixes** * Support html5lib-python 1.0.1. (#337) * Add deprecation warning for supporting html5lib-python < 1.0. * Switch to semver. Version 2.1.1 (October 2nd, 2017) --------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** None **Bug fixes** * Fix ``setup.py`` opening files when ``LANG=``. (#324) Version 2.1 (September 28th, 2017) ---------------------------------- **Security fixes** * Convert control characters (backspace particularly) to "?" preventing malicious copy-and-paste situations. (#298) See ``_ for more details. This affects all previous versions of Bleach. Check the comments on that issue for ways to alleviate the issue if you can't upgrade to Bleach 2.1. **Backwards incompatible changes** * Redid versioning. ``bleach.VERSION`` is no longer available. Use the string version at ``bleach.__version__`` and parse it with ``pkg_resources.parse_version``. (#307) * clean, linkify: linkify and clean should only accept text types; thank you, Janusz! (#292) * clean, linkify: accept only unicode or utf-8-encoded str (#176) **Features** **Bug fixes** * ``bleach.clean()`` no longer unescapes entities including ones that are missing a ``;`` at the end which can happen in urls and other places. (#143) * linkify: fix http links inside of mailto links; thank you, sedrubal! (#300) * clarify security policy in docs (#303) * fix dependency specification for html5lib 1.0b8, 1.0b9, and 1.0b10; thank you, Zoltán! (#268) * add Bleach vs. html5lib comparison to README; thank you, Stu Cox! (#278) * fix KeyError exceptions on tags without href attr; thank you, Alex Defsen! (#273) * add test website and scripts to test ``bleach.clean()`` output in browser; thank you, Greg Guthe! Version 2.0 (March 8th, 2017) ----------------------------- **Security fixes** * None **Backwards incompatible changes** * Removed support for Python 2.6. #206 * Removed support for Python 3.2. #224 * Bleach no longer supports html5lib < 0.99999999 (8 9s). This version is a rewrite to use the new sanitizing API since the old one was dropped in html5lib 0.99999999 (8 9s). If you're using 0.9999999 (7 9s) upgrade to 0.99999999 (8 9s) or higher. If you're using 1.0b8 (equivalent to 0.9999999 (7 9s)), upgrade to 1.0b9 (equivalent to 0.99999999 (8 9s)) or higher. * ``bleach.clean`` and friends were rewritten ``clean`` was reimplemented as an html5lib filter and happens at a different step in the HTML parsing -> traversing -> serializing process. Because of that, there are some differences in clean's output as compared with previous versions. Amongst other things, this version will add end tags even if the tag in question is to be escaped. * ``bleach.clean`` and friends attribute callables now take three arguments: tag, attribute name and attribute value. Previously they only took attribute name and attribute value. All attribute callables will need to be updated. * ``bleach.linkify`` was rewritten ``linkify`` was reimplemented as an html5lib Filter. As such, it no longer accepts a ``tokenizer`` argument. The callback functions for adjusting link attributes now takes a namespaced attribute. Previously you'd do something like this:: def check_protocol(attrs, is_new): if not attrs.get('href', '').startswith('http:', 'https:')): return None return attrs Now it's more like this:: def check_protocol(attrs, is_new): if not attrs.get((None, u'href'), u'').startswith(('http:', 'https:')): # ^^^^^^^^^^^^^^^ return None return attrs Further, you need to make sure you're always using unicode values. If you don't then html5lib will raise an assertion error that the value is not unicode. All linkify filters will need to be updated. * ``bleach.linkify`` and friends had a ``skip_pre`` argument--that's been replaced with a more general ``skip_tags`` argument. Before, you might do:: bleach.linkify(some_text, skip_pre=True) The equivalent with Bleach 2.0 is:: bleach.linkify(some_text, skip_tags=['pre']) You can skip other tags, too, like ``style`` or ``script`` or other places where you don't want linkification happening. All uses of linkify that use ``skip_pre`` will need to be updated. **Changes** * Supports Python 3.6. * Supports html5lib >= 0.99999999 (8 9s). * There's a ``bleach.sanitizer.Cleaner`` class that you can instantiate with your favorite clean settings for easy reuse. * There's a ``bleach.linkifier.Linker`` class that you can instantiate with your favorite linkify settings for easy reuse. * There's a ``bleach.linkifier.LinkifyFilter`` which is an htm5lib filter that you can pass as a filter to ``bleach.sanitizer.Cleaner`` allowing you to clean and linkify in one pass. * ``bleach.clean`` and friends can now take a callable as an attributes arg value. * Tons of bug fixes. * Cleaned up tests. * Documentation fixes. Version 1.5 (November 4th, 2016) -------------------------------- **Security fixes** * None **Backwards incompatible changes** * clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and mailto. Previously it was a long list of protocols something like ed2k, ftp, http, https, irc, mailto, news, gopher, nntp, telnet, webcal, xmpp, callto, feed, urn, aim, rsync, tag, ssh, sftp, rtsp, afs, data. #149 **Changes** * clean: Added ``protocols`` to arguments list to let you override the list of allowed protocols. Thank you, Andreas Malecki! #149 * linkify: Fix a bug involving periods at the end of an email address. Thank you, Lorenz Schori! #219 * linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies! #207 * linkify: Fix linkify inappropriately removing node tails when dropping nodes. #132 * Fixed a test that failed periodically. #161 * Switched from nose to py.test. #204 * Add test matrix for all supported Python and html5lib versions. #230 * Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999 and 0.99999 are busted. * Add support for ``python setup.py test``. #97 Version 1.4.3 (May 23rd, 2016) ------------------------------ **Security fixes** * None **Changes** * Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to sanitizer api. #195 Version 1.4.2 (September 11, 2015) ---------------------------------- **Changes** * linkify: Fix hang in linkify with ``parse_email=True``. #124 * linkify: Fix crash in linkify when removing a link that is a first-child. #136 * Updated TLDs. * linkify: Don't remove exterior brackets when linkifying. #146 Version 1.4.1 (December 15, 2014) --------------------------------- **Changes** * Consistent order of attributes in output. * Python 3.4 support. Version 1.4 (January 12, 2014) ------------------------------ **Changes** * linkify: Update linkify to use etree type Treewalker instead of simpletree. * Updated html5lib to version ``>=0.999``. * Update all code to be compatible with Python 3 and 2 using six. * Switch to Apache License. Version 1.3 ----------- * Used by Python 3-only fork. Version 1.2.2 (May 18, 2013) ---------------------------- * Pin html5lib to version 0.95 for now due to major API break. Version 1.2.1 (February 19, 2013) --------------------------------- * ``clean()`` no longer considers ``feed:`` an acceptable protocol due to inconsistencies in browser behavior. Version 1.2 (January 28, 2013) ------------------------------ * ``linkify()`` has changed considerably. Many keyword arguments have been replaced with a single callbacks list. Please see the documentation for more information. * Bleach will no longer consider unacceptable protocols when linkifying. * ``linkify()`` now takes a tokenizer argument that allows it to skip sanitization. * ``delinkify()`` is gone. * Removed exception handling from ``_render``. ``clean()`` and ``linkify()`` may now throw. * ``linkify()`` correctly ignores case for protocols and domain names. * ``linkify()`` correctly handles markup within an tag. Version 1.1.5 ------------- Version 1.1.4 ------------- Version 1.1.3 (July 10, 2012) ----------------------------- * Fix parsing bare URLs when parse_email=True. Version 1.1.2 (June 1, 2012) ---------------------------- * Fix hang in style attribute sanitizer. (#61) * Allow ``/`` in style attribute values. Version 1.1.1 (February 17, 2012) --------------------------------- * Fix tokenizer for html5lib 0.9.5. Version 1.1.0 (October 24, 2011) -------------------------------- * ``linkify()`` now understands port numbers. (#38) * Documented character encoding behavior. (#41) * Add an optional target argument to ``linkify()``. * Add ``delinkify()`` method. (#45) * Support subdomain whitelist for ``delinkify()``. (#47, #48) Version 1.0.4 (September 2, 2011) --------------------------------- * Switch to SemVer git tags. * Make ``linkify()`` smarter about trailing punctuation. (#30) * Pass ``exc_info`` to logger during rendering issues. * Add wildcard key for attributes. (#19) * Make ``linkify()`` use the ``HTMLSanitizer`` tokenizer. (#36) * Fix URLs wrapped in parentheses. (#23) * Make ``linkify()`` UTF-8 safe. (#33) Version 1.0.3 (June 14, 2011) ----------------------------- * ``linkify()`` works with 3rd level domains. (#24) * ``clean()`` supports vendor prefixes in style values. (#31, #32) * Fix ``linkify()`` email escaping. Version 1.0.2 (June 6, 2011) ---------------------------- * ``linkify()`` supports email addresses. * ``clean()`` supports callables in attributes filter. Version 1.0.1 (April 12, 2011) ------------------------------ * ``linkify()`` doesn't drop trailing slashes. (#21) * ``linkify()`` won't linkify 'libgl.so.1'. (#22) bleach-3.1.1/CODE_OF_CONDUCT.rst000066400000000000000000000005621362327062400157510ustar00rootroot00000000000000Code of conduct =============== This project and repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details please see the `Mozilla Community Participation Guidelines `_ and `Developer Etiquette Guidelines `_. bleach-3.1.1/CONTRIBUTING.rst000066400000000000000000000012671362327062400154060ustar00rootroot00000000000000Reporting Bugs ============== For regular bugs, please report them `in our issue tracker `_. If you believe that you've found a security vulnerability, please `file a secure bug report in our bug tracker `_ or send an email to *security AT mozilla DOT org*. For more information on security-related bug disclosure and the PGP key to use for sending encrypted mail or to verify responses received from that address, please read our wiki page at ``_. bleach-3.1.1/CONTRIBUTORS000066400000000000000000000023161362327062400146210ustar00rootroot00000000000000Bleach was originally written and maintained by James Socol and various contributors within and without the Mozilla Corporation and Foundation. It is currently maintained by Will Kahn-Greene an Greg Guthe. Maintainers: - Will Kahn-Greene - Greg Guthe Maintainer emeritus: - Jannis Leidel - James Socol Contributors: - Adam Lofts - Adrian "ThiefMaster" - Alek - Alex Defsen - Alex Ehlke - Alexandre Macabies - Alexandr N. Zamaraev - Alireza Savand - Andreas Malecki - Andy Freeland - Antoine Leclair - Anton Backer - Anton Kovalyov - Chad Birch - Chris Beaven - Dan Gayle - dave-shawley - Erik Rose - Gaurav Dadhania - Geoffrey Sneddon - Greg Guthe - hugovk - Istvan Albert - Jaime Irurzun - James Socol - Jannis Leidel - Janusz Kamieński - Jeff Balogh - Jonathan Vanasco - Lee, Cheon-il - Les Orchard - Lorenz Schori - Luis Nell - Marc Abramowitz - Marc DM - Mark Lee - Mark Paschal - mdxs - Nikita Sobolev - nikolas - Oh Jinkyun - Paul Craciunoiu - Ricky Rosario - Ryan Niemeyer - Sébastien Fievet - sedrubal - Stephane Blondon - Stu Cox - Tim Dumol - Timothy Fitz - Vadim Kotov - Vitaly Volkov - Will Kahn-Greene - Zoltán - zyegfryed bleach-3.1.1/LICENSE000066400000000000000000000010711362327062400137430ustar00rootroot00000000000000Copyright (c) 2014-2017, Mozilla Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. bleach-3.1.1/MANIFEST.in000066400000000000000000000006631362327062400145020ustar00rootroot00000000000000include CHANGES include CONTRIBUTORS include CONTRIBUTING.rst include CODE_OF_CONDUCT.rst include requirements-dev.txt include tox.ini include LICENSE include README.rst include docs/conf.py include docs/Makefile include scripts/* recursive-include bleach *.py *.json *.rst *.sh *.txt INSTALLER METADATA RECORD WHEEL recursive-include docs *.rst recursive-include tests *.py *.test recursive-include tests_website *.html *.py *.rst bleach-3.1.1/README.rst000066400000000000000000000072011362327062400144260ustar00rootroot00000000000000====== Bleach ====== .. image:: https://travis-ci.org/mozilla/bleach.svg?branch=master :target: https://travis-ci.org/mozilla/bleach .. image:: https://badge.fury.io/py/bleach.svg :target: http://badge.fury.io/py/bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, applying filters that Django's ``urlize`` filter cannot, and optionally setting ``rel`` attributes, even on links already in the text. Bleach is intended for sanitizing text from *untrusted* sources. If you find yourself jumping through hoops to allow your site administrators to do lots of things, you're probably outside the use cases. Either trust those users, or don't. Because it relies on html5lib_, Bleach is as good as modern browsers at dealing with weird, quirky HTML fragments. And *any* of Bleach's methods will fix unbalanced or mis-nested tags. The version on GitHub_ is the most up-to-date and contains the latest bug fixes. You can find full documentation on `ReadTheDocs`_. :Code: https://github.com/mozilla/bleach :Documentation: https://bleach.readthedocs.io/ :Issue tracker: https://github.com/mozilla/bleach/issues :IRC: ``#bleach`` on irc.mozilla.org :License: Apache License v2; see LICENSE file Reporting Bugs ============== For regular bugs, please report them `in our issue tracker `_. If you believe that you've found a security vulnerability, please `file a secure bug report in our bug tracker `_ or send an email to *security AT mozilla DOT org*. For more information on security-related bug disclosure and the PGP key to use for sending encrypted mail or to verify responses received from that address, please read our wiki page at ``_. Security ======== Bleach is a security-focused library. We have a responsible security vulnerability reporting process. Please use that if you're reporting a security issue. Security issues are fixed in private. After we land such a fix, we'll do a release. For every release, we mark security issues we've fixed in the ``CHANGES`` in the **Security issues** section. We include any relevant CVE links. Installing Bleach ================= Bleach is available on PyPI_, so you can install it with ``pip``:: $ pip install bleach Upgrading Bleach ================ .. warning:: Before doing any upgrades, read through `Bleach Changes `_ for backwards incompatible changes, newer versions, etc. Basic use ========= The simplest way to use Bleach is: .. code-block:: python >>> import bleach >>> bleach.clean('an example') u'an <script>evil()</script> example' >>> bleach.linkify('an http://example.com url') u'an http://example.com url Code of conduct =============== This project and repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details please see the `Mozilla Community Participation Guidelines `_ and `Developer Etiquette Guidelines `_. .. _html5lib: https://github.com/html5lib/html5lib-python .. _GitHub: https://github.com/mozilla/bleach .. _ReadTheDocs: https://bleach.readthedocs.io/ .. _PyPI: http://pypi.python.org/pypi/bleach bleach-3.1.1/bleach/000077500000000000000000000000001362327062400141555ustar00rootroot00000000000000bleach-3.1.1/bleach/__init__.py000066400000000000000000000072771362327062400163030ustar00rootroot00000000000000# -*- coding: utf-8 -*- from __future__ import unicode_literals from pkg_resources import parse_version from bleach.linkifier import ( DEFAULT_CALLBACKS, Linker, ) from bleach.sanitizer import ( ALLOWED_ATTRIBUTES, ALLOWED_PROTOCOLS, ALLOWED_STYLES, ALLOWED_TAGS, Cleaner, ) # yyyymmdd __releasedate__ = '20200213' # x.y.z or x.y.z.dev0 -- semver __version__ = '3.1.1' VERSION = parse_version(__version__) __all__ = ['clean', 'linkify'] def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, styles=ALLOWED_STYLES, protocols=ALLOWED_PROTOCOLS, strip=False, strip_comments=True): """Clean an HTML fragment of malicious content and return it This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page. This function is not designed to use to transform content to be used in non-web-page contexts. Example:: import bleach better_text = bleach.clean(yucky_text) .. Note:: If you're cleaning a lot of text and passing the same argument values or you want more configurability, consider using a :py:class:`bleach.sanitizer.Cleaner` instance. :arg str text: the text to clean :arg list tags: allowed list of tags; defaults to ``bleach.sanitizer.ALLOWED_TAGS`` :arg dict attributes: allowed attributes; can be a callable, list or dict; defaults to ``bleach.sanitizer.ALLOWED_ATTRIBUTES`` :arg list styles: allowed list of css styles; defaults to ``bleach.sanitizer.ALLOWED_STYLES`` :arg list protocols: allowed list of protocols for links; defaults to ``bleach.sanitizer.ALLOWED_PROTOCOLS`` :arg bool strip: whether or not to strip disallowed elements :arg bool strip_comments: whether or not to strip HTML comments :returns: cleaned text as unicode """ cleaner = Cleaner( tags=tags, attributes=attributes, styles=styles, protocols=protocols, strip=strip, strip_comments=strip_comments, ) return cleaner.clean(text) def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_tags=None, parse_email=False): """Convert URL-like strings in an HTML fragment to links This function converts strings that look like URLs, domain names and email addresses in text that may be an HTML fragment to links, while preserving: 1. links already in the string 2. urls found in attributes 3. email addresses linkify does a best-effort approach and tries to recover from bad situations due to crazy text. .. Note:: If you're linking a lot of text and passing the same argument values or you want more configurability, consider using a :py:class:`bleach.linkifier.Linker` instance. .. Note:: If you have text that you want to clean and then linkify, consider using the :py:class:`bleach.linkifier.LinkifyFilter` as a filter in the clean pass. That way you're not parsing the HTML twice. :arg str text: the text to linkify :arg list callbacks: list of callbacks to run when adjusting tag attributes; defaults to ``bleach.linkifier.DEFAULT_CALLBACKS`` :arg list skip_tags: list of tags that you don't want to linkify the contents of; for example, you could set this to ``['pre']`` to skip linkifying contents of ``pre`` tags :arg bool parse_email: whether or not to linkify email addresses :returns: linkified text as unicode """ linker = Linker( callbacks=callbacks, skip_tags=skip_tags, parse_email=parse_email ) return linker.linkify(text) bleach-3.1.1/bleach/_vendor/000077500000000000000000000000001362327062400156115ustar00rootroot00000000000000bleach-3.1.1/bleach/_vendor/README.rst000066400000000000000000000023511362327062400173010ustar00rootroot00000000000000======================= Vendored library policy ======================= To simplify Bleach development, we're now vendoring certain libraries that we use. Vendored libraries must follow these rules: 1. Vendored libraries must be pure Python--no compiling. 2. Source code for the libary is included in this directory. 3. License must be included in this repo and in the Bleach distribution. 4. Requirements of the library become requirements of Bleach. 5. No modifications to the library may be made. Adding/Updating a vendored library ================================== Way to vendor a library or update a version: 1. Update ``vendor.txt`` with the library, version, and hash. You can use `hashin `_. 2. Remove all old files and directories of the old version. 3. Run ``pip_install_vendor.sh`` and check everything it produced in including the ``.dist-info`` directory and contents. Reviewing a change involving a vendored library =============================================== Way to verify a vendored library addition/update: 1. Pull down the branch. 2. Delete all the old files and directories of the old version. 3. Run ``pip_install_vendor.sh``. 4. Run ``git diff`` and verify there are no changes. bleach-3.1.1/bleach/_vendor/__init__.py000066400000000000000000000000001362327062400177100ustar00rootroot00000000000000bleach-3.1.1/bleach/_vendor/html5lib-1.0.1.dist-info/000077500000000000000000000000001362327062400216575ustar00rootroot00000000000000bleach-3.1.1/bleach/_vendor/html5lib-1.0.1.dist-info/DESCRIPTION.rst000066400000000000000000000327031362327062400242010ustar00rootroot00000000000000html5lib ======== .. image:: https://travis-ci.org/html5lib/html5lib-python.png?branch=master :target: https://travis-ci.org/html5lib/html5lib-python html5lib is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. Usage ----- Simple usage follows this pattern: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: document = html5lib.parse(f) or: .. code-block:: python import html5lib document = html5lib.parse("

Hello World!") By default, the ``document`` will be an ``xml.etree`` element instance. Whenever possible, html5lib chooses the accelerated ``ElementTree`` implementation (i.e. ``xml.etree.cElementTree`` on Python 2.x). Two other tree types are supported: ``xml.dom.minidom`` and ``lxml.etree``. To use an alternative format, specify the name of a treebuilder: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml") When using with ``urllib2`` (Python 2), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from contextlib import closing from urllib2 import urlopen import html5lib with closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, transport_encoding=f.info().getparam("charset")) When using with ``urllib.request`` (Python 3), the charset from HTTP should be pass into html5lib as follows: .. code-block:: python from urllib.request import urlopen import html5lib with urlopen("http://example.com/") as f: document = html5lib.parse(f, transport_encoding=f.info().get_content_charset()) To have more control over the parser, create a parser object explicitly. For instance, to make the parser raise exceptions on parse errors, use: .. code-block:: python import html5lib with open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f) When you're instantiating parser objects explicitly, pass a treebuilder class as the ``tree`` keyword argument to use an alternative document format: .. code-block:: python import html5lib parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom")) minidom_document = parser.parse("

Hello World!") More documentation is available at https://html5lib.readthedocs.io/. Installation ------------ html5lib works on CPython 2.7+, CPython 3.3+ and PyPy. To install it, use: .. code-block:: bash $ pip install html5lib Optional Dependencies --------------------- The following third-party libraries may be used for additional functionality: - ``datrie`` can be used under CPython to improve parsing performance (though in almost all cases the improvement is marginal); - ``lxml`` is supported as a tree format (for both building and walking) under CPython (but *not* PyPy where it is known to cause segfaults); - ``genshi`` has a treewalker (but not builder); and - ``chardet`` can be used as a fallback when character encoding cannot be determined. Bugs ---- Please report any bugs on the `issue tracker `_. Tests ----- Unit tests require the ``pytest`` and ``mock`` libraries and can be run using the ``py.test`` command in the root directory. Test data are contained in a separate `html5lib-tests `_ repository and included as a submodule, thus for git checkouts they must be initialized:: $ git submodule init $ git submodule update If you have all compatible Python implementations available on your system, you can run tests on all of them using the ``tox`` utility, which can be found on PyPI. Questions? ---------- There's a mailing list available for support on Google Groups, `html5lib-discuss `_, though you may get a quicker response asking on IRC in `#whatwg on irc.freenode.net `_. Change Log ---------- 1.0.1 ~~~~~ Released on December 7, 2017 Breaking changes: * Drop support for Python 2.6. (#330) (Thank you, Hugo, Will Kahn-Greene!) * Remove ``utils/spider.py`` (#353) (Thank you, Jon Dufresne!) Features: * Improve documentation. (#300, #307) (Thank you, Jon Dufresne, Tom Most, Will Kahn-Greene!) * Add iframe seamless boolean attribute. (Thank you, Ritwik Gupta!) * Add itemscope as a boolean attribute. (#194) (Thank you, Jonathan Vanasco!) * Support Python 3.6. (#333) (Thank you, Jon Dufresne!) * Add CI support for Windows using AppVeyor. (Thank you, John Vandenberg!) * Improve testing and CI and add code coverage (#323, #334), (Thank you, Jon Dufresne, John Vandenberg, Geoffrey Sneddon, Will Kahn-Greene!) * Semver-compliant version number. Bug fixes: * Add support for setuptools < 18.5 to support environment markers. (Thank you, John Vandenberg!) * Add explicit dependency for six >= 1.9. (Thank you, Eric Amorde!) * Fix regexes to work with Python 3.7 regex adjustments. (#318, #379) (Thank you, Benedikt Morbach, Ville Skyttä, Mark Vasilkov!) * Fix alphabeticalattributes filter namespace bug. (#324) (Thank you, Will Kahn-Greene!) * Include license file in generated wheel package. (#350) (Thank you, Jon Dufresne!) * Fix annotation-xml typo. (#339) (Thank you, Will Kahn-Greene!) * Allow uppercase hex chararcters in CSS colour check. (#377) (Thank you, Komal Dembla, Hugo!) 1.0 ~~~ Released and unreleased on December 7, 2017. Badly packaged release. 0.999999999/1.0b10 ~~~~~~~~~~~~~~~~~~ Released on July 15, 2016 * Fix attribute order going to the tree builder to be document order instead of reverse document order(!). 0.99999999/1.0b9 ~~~~~~~~~~~~~~~~ Released on July 14, 2016 * **Added ordereddict as a mandatory dependency on Python 2.6.** * Added ``lxml``, ``genshi``, ``datrie``, ``charade``, and ``all`` extras that will do the right thing based on the specific interpreter implementation. * Now requires the ``mock`` package for the testsuite. * Cease supporting DATrie under PyPy. * **Remove PullDOM support, as this hasn't ever been properly tested, doesn't entirely work, and as far as I can tell is completely unused by anyone.** * Move testsuite to ``py.test``. * **Fix #124: move to webencodings for decoding the input byte stream; this makes html5lib compliant with the Encoding Standard, and introduces a required dependency on webencodings.** * **Cease supporting Python 3.2 (in both CPython and PyPy forms).** * **Fix comments containing double-dash with lxml 3.5 and above.** * **Use scripting disabled by default (as we don't implement scripting).** * **Fix #11, avoiding the XSS bug potentially caused by serializer allowing attribute values to be escaped out of in old browser versions, changing the quote_attr_values option on serializer to take one of three values, "always" (the old True value), "legacy" (the new option, and the new default), and "spec" (the old False value, and the old default).** * **Fix #72 by rewriting the sanitizer to apply only to treewalkers (instead of the tokenizer); as such, this will require amending all callers of it to use it via the treewalker API.** * **Drop support of charade, now that chardet is supported once more.** * **Replace the charset keyword argument on parse and related methods with a set of keyword arguments: override_encoding, transport_encoding, same_origin_parent_encoding, likely_encoding, and default_encoding.** * **Move filters._base, treebuilder._base, and treewalkers._base to .base to clarify their status as public.** * **Get rid of the sanitizer package. Merge sanitizer.sanitize into the sanitizer.htmlsanitizer module and move that to sanitizer. This means anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no code changes.** * **Rename treewalkers.lxmletree to .etree_lxml and treewalkers.genshistream to .genshi to have a consistent API.** * Move a whole load of stuff (inputstream, ihatexml, trie, tokenizer, utils) to be underscore prefixed to clarify their status as private. 0.9999999/1.0b8 ~~~~~~~~~~~~~~~ Released on September 10, 2015 * Fix #195: fix the sanitizer to drop broken URLs (it threw an exception between 0.9999 and 0.999999). 0.999999/1.0b7 ~~~~~~~~~~~~~~ Released on July 7, 2015 * Fix #189: fix the sanitizer to allow relative URLs again (as it did prior to 0.9999/1.0b5). 0.99999/1.0b6 ~~~~~~~~~~~~~ Released on April 30, 2015 * Fix #188: fix the sanitizer to not throw an exception when sanitizing bogus data URLs. 0.9999/1.0b5 ~~~~~~~~~~~~ Released on April 29, 2015 * Fix #153: Sanitizer fails to treat some attributes as URLs. Despite how this sounds, this has no known security implications. No known version of IE (5.5 to current), Firefox (3 to current), Safari (6 to current), Chrome (1 to current), or Opera (12 to current) will run any script provided in these attributes. * Pass error message to the ParseError exception in strict parsing mode. * Allow data URIs in the sanitizer, with a whitelist of content-types. * Add support for Python implementations that don't support lone surrogates (read: Jython). Fixes #2. * Remove localization of error messages. This functionality was totally unused (and untested that everything was localizable), so we may as well follow numerous browsers in not supporting translating technical strings. * Expose treewalkers.pprint as a public API. * Add a documentEncoding property to HTML5Parser, fix #121. 0.999 ~~~~~ Released on December 23, 2013 * Fix #127: add work-around for CPython issue #20007: .read(0) on http.client.HTTPResponse drops the rest of the content. * Fix #115: lxml treewalker can now deal with fragments containing, at their root level, text nodes with non-ASCII characters on Python 2. 0.99 ~~~~ Released on September 10, 2013 * No library changes from 1.0b3; released as 0.99 as pip has changed behaviour from 1.4 to avoid installing pre-release versions per PEP 440. 1.0b3 ~~~~~ Released on July 24, 2013 * Removed ``RecursiveTreeWalker`` from ``treewalkers._base``. Any implementation using it should be moved to ``NonRecursiveTreeWalker``, as everything bundled with html5lib has for years. * Fix #67 so that ``BufferedStream`` to correctly returns a bytes object, thereby fixing any case where html5lib is passed a non-seekable RawIOBase-like object. 1.0b2 ~~~~~ Released on June 27, 2013 * Removed reordering of attributes within the serializer. There is now an ``alphabetical_attributes`` option which preserves the previous behaviour through a new filter. This allows attribute order to be preserved through html5lib if the tree builder preserves order. * Removed ``dom2sax`` from DOM treebuilders. It has been replaced by ``treeadapters.sax.to_sax`` which is generic and supports any treewalker; it also resolves all known bugs with ``dom2sax``. * Fix treewalker assertions on hitting bytes strings on Python 2. Previous to 1.0b1, treewalkers coped with mixed bytes/unicode data on Python 2; this reintroduces this prior behaviour on Python 2. Behaviour is unchanged on Python 3. 1.0b1 ~~~~~ Released on May 17, 2013 * Implementation updated to implement the `HTML specification `_ as of 5th May 2013 (`SVN `_ revision r7867). * Python 3.2+ supported in a single codebase using the ``six`` library. * Removed support for Python 2.5 and older. * Removed the deprecated Beautiful Soup 3 treebuilder. ``beautifulsoup4`` can use ``html5lib`` as a parser instead. Note that since it doesn't support namespaces, foreign content like SVG and MathML is parsed incorrectly. * Removed ``simpletree`` from the package. The default tree builder is now ``etree`` (using the ``xml.etree.cElementTree`` implementation if available, and ``xml.etree.ElementTree`` otherwise). * Removed the ``XHTMLSerializer`` as it never actually guaranteed its output was well-formed XML, and hence provided little of use. * Removed default DOM treebuilder, so ``html5lib.treebuilders.dom`` is no longer supported. ``html5lib.treebuilders.getTreeBuilder("dom")`` will return the default DOM treebuilder, which uses ``xml.dom.minidom``. * Optional heuristic character encoding detection now based on ``charade`` for Python 2.6 - 3.3 compatibility. * Optional ``Genshi`` treewalker support fixed. * Many bugfixes, including: * #33: null in attribute value breaks XML AttValue; * #4: nested, indirect descendant,

bleach-3.1.1/tests_website/open_test_page.py000077500000000000000000000012221362327062400211710ustar00rootroot00000000000000#!/usr/bin/env python import webbrowser TEST_BROWSERS = set([ # 'mozilla', 'firefox', # 'netscape', # 'galeon', # 'epiphany', # 'skipstone', # 'kfmclient', # 'konqueror', # 'kfm', # 'mosaic', # 'opera', # 'grail', # 'links', # 'elinks', # 'lynx', # 'w3m', 'windows-default', # 'macosx', 'safari', # 'google-chrome', 'chrome', # 'chromium', # 'chromium-browser', ]) REGISTERED_BROWSERS = set(webbrowser._browsers.keys()) if __name__ == '__main__': for b in TEST_BROWSERS & REGISTERED_BROWSERS: webbrowser.get(b).open_new_tab('http://localhost:8080') bleach-3.1.1/tests_website/server.py000077500000000000000000000025001362327062400175030ustar00rootroot00000000000000#!/usr/bin/env python """ Simple Test/Demo Server for running bleach.clean output on various desktops. Usage: python server.py """ import six import bleach PORT = 8080 class BleachCleanHandler(six.moves.SimpleHTTPServer.SimpleHTTPRequestHandler): def do_POST(self): if six.PY2: content_len = int(self.headers.getheader('content-length', 0)) else: content_len = int(self.headers.get('content-length', 0)) body = self.rfile.read(content_len) print("read %s bytes: %s" % (content_len, body)) if six.PY3: body = body.decode('utf-8') print('input: %r' % body) cleaned = bleach.clean(body) self.send_response(200) self.send_header('Content-Length', len(cleaned)) self.send_header('Content-Type', 'text/plain;charset=UTF-8') self.end_headers() if six.PY3: cleaned = bytes(cleaned, encoding='utf-8') print("cleaned: %r" % cleaned) self.wfile.write(cleaned) if __name__ == '__main__': # Prevent 'cannot bind to address' errors on restart six.moves.socketserver.TCPServer.allow_reuse_address = True httpd = six.moves.socketserver.TCPServer(('127.0.0.1', PORT), BleachCleanHandler) print("listening on localhost port %d" % PORT) httpd.serve_forever() bleach-3.1.1/tox.ini000066400000000000000000000031441362327062400142540ustar00rootroot00000000000000# Tox (http://tox.testrun.org/) is a tool for running tests # in multiple virtualenvs. This configuration file will run the # test suite on all supported python versions. To use it, "pip install tox" # and then run "tox" from this directory. # Note: If you update this, make sure to update .travis.yml, too. [tox] envlist = py{27,34,35,36,37} pypy py{27,34,35,36,37}-build-no-lang docs lint vendorverify [testenv] basepython = py27: python2.7 py34: python3.4 py35: python3.5 py36: python3.6 py37: python3.7 deps = -rrequirements-dev.txt commands = pytest {posargs:-v} python setup.py build [testenv:py27-build-no-lang] basepython = python2.7 setenv = LANG= commands = python setup.py build [testenv:py34-build-no-lang] basepython = python3.4 setenv = LANG= commands = python setup.py build [testenv:py35-build-no-lang] basepython = python3.5 setenv = LANG= commands = python setup.py build [testenv:py36-build-no-lang] basepython = python3.6 setenv = LANG= commands = python setup.py build [testenv:py37-build-no-lang] basepython = python3.7 setenv = LANG= commands = python setup.py build [testenv:lint] basepython = python3.6 changedir = scripts deps = -rrequirements-dev.txt commands = ./run_tests.sh lint [testenv:vendorverify] basepython = python3.6 changedir = scripts deps = -rrequirements-dev.txt commands = ./run_tests.sh vendorverify [testenv:docs] basepython = python3.6 changedir = docs deps = -rrequirements-dev.txt commands = sphinx-build -b html -d {envtmpdir}/doctrees . {envtmpdir}/html