pax_global_header00006660000000000000000000000064132122625430014512gustar00rootroot0000000000000052 comment=35d051ca4b54a3a3b16c243cac3f3bb6678e69bb bleach-2.1.2/000077500000000000000000000000001321226254300127325ustar00rootroot00000000000000bleach-2.1.2/.gitignore000066400000000000000000000001521321226254300147200ustar00rootroot00000000000000*.pyo *.pyc pip-log.txt .coverage dist *.egg-info .noseids build .tox docs/_build/ .cache/ .eggs/ .*env*/ bleach-2.1.2/.travis.yml000066400000000000000000000014361321226254300150470ustar00rootroot00000000000000sudo: false language: python cache: directories: - "~/.cache/pip" python: - "2.7" - "3.3" - "3.4" - "3.5" - "3.6" - "pypy" env: - HTML5LIB=0.99999999 # 8 - HTML5LIB=0.999999999 # 9 install: # html5lib 0.99999999 (8 9s) requires at least setuptools 18.5 - pip install -U pip setuptools>=18.5 - pip install -r requirements.txt # stomp on html5lib install with the specified one - pip install html5lib==$HTML5LIB script: - py.test - flake8 bleach/ deploy: provider: pypi user: jezdez distributions: sdist bdist_wheel password: secure: TTLpnNBAmRBPe4qITwtM6MRXw3CvGpflnkG6V97oKYL1RJhDXmxIxxImkGyVoT2IR4Oy/jqEikWUCCC3aDoqDnIkkDVriTPmo5PGnS2WgvEmYdcaTIp+RXdKwKhpCVX8ITEuye0iCXYu28vDaySGjnxjlYAP4S0PGPUzh/tn4DY= on: tags: true repo: mozilla/bleach python: "2.7" bleach-2.1.2/CHANGES000066400000000000000000000237161321226254300137360ustar00rootroot00000000000000Bleach Changes ============== Version 2.1.2 (December 7th, 2017) ---------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** None **Bug fixes** * Support html5lib-python 1.0.1. (#337) * Add deprecation warning for supporting html5lib-python < 1.0. * Switch to semver. Version 2.1.1 (October 2nd, 2017) --------------------------------- **Security fixes** None **Backwards incompatible changes** None **Features** None **Bug fixes** * Fix ``setup.py`` opening files when ``LANG=``. (#324) Version 2.1 (September 28th, 2017) ---------------------------------- **Security fixes** * Convert control characters (backspace particularly) to "?" preventing malicious copy-and-paste situations. (#298) See ``_ for more details. This affects all previous versions of Bleach. Check the comments on that issue for ways to alleviate the issue if you can't upgrade to Bleach 2.1. **Backwards incompatible changes** * Redid versioning. ``bleach.VERSION`` is no longer available. Use the string version at ``bleach.__version__`` and parse it with ``pkg_resources.parse_version``. (#307) * clean, linkify: linkify and clean should only accept text types; thank you, Janusz! (#292) * clean, linkify: accept only unicode or utf-8-encoded str (#176) **Features** **Bug fixes** * ``bleach.clean()`` no longer unescapes entities including ones that are missing a ``;`` at the end which can happen in urls and other places. (#143) * linkify: fix http links inside of mailto links; thank you, sedrubal! (#300) * clarify security policy in docs (#303) * fix dependency specification for html5lib 1.0b8, 1.0b9, and 1.0b10; thank you, Zoltán! (#268) * add Bleach vs. html5lib comparison to README; thank you, Stu Cox! (#278) * fix KeyError exceptions on tags without href attr; thank you, Alex Defsen! (#273) * add test website and scripts to test ``bleach.clean()`` output in browser; thank you, Greg Guthe! Version 2.0 (March 8th, 2017) ----------------------------- **Security fixes** * None **Backwards incompatible changes** * Removed support for Python 2.6. #206 * Removed support for Python 3.2. #224 * Bleach no longer supports html5lib < 0.99999999 (8 9s). This version is a rewrite to use the new sanitizing API since the old one was dropped in html5lib 0.99999999 (8 9s). If you're using 0.9999999 (7 9s) upgrade to 0.99999999 (8 9s) or higher. If you're using 1.0b8 (equivalent to 0.9999999 (7 9s)), upgrade to 1.0b9 (equivalent to 0.99999999 (8 9s)) or higher. * ``bleach.clean`` and friends were rewritten ``clean`` was reimplemented as an html5lib filter and happens at a different step in the HTML parsing -> traversing -> serializing process. Because of that, there are some differences in clean's output as compared with previous versions. Amongst other things, this version will add end tags even if the tag in question is to be escaped. * ``bleach.clean`` and friends attribute callables now take three arguments: tag, attribute name and attribute value. Previously they only took attribute name and attribute value. All attribute callables will need to be updated. * ``bleach.linkify`` was rewritten ``linkify`` was reimplemented as an html5lib Filter. As such, it no longer accepts a ``tokenizer`` argument. The callback functions for adjusting link attributes now takes a namespaced attribute. Previously you'd do something like this:: def check_protocol(attrs, is_new): if not attrs.get('href', '').startswith('http:', 'https:')): return None return attrs Now it's more like this:: def check_protocol(attrs, is_new): if not attrs.get((None, u'href'), u'').startswith(('http:', 'https:')): # ^^^^^^^^^^^^^^^ return None return attrs Further, you need to make sure you're always using unicode values. If you don't then html5lib will raise an assertion error that the value is not unicode. All linkify filters will need to be updated. * ``bleach.linkify`` and friends had a ``skip_pre`` argument--that's been replaced with a more general ``skip_tags`` argument. Before, you might do:: bleach.linkify(some_text, skip_pre=True) The equivalent with Bleach 2.0 is:: bleach.linkify(some_text, skip_tags=['pre']) You can skip other tags, too, like ``style`` or ``script`` or other places where you don't want linkification happening. All uses of linkify that use ``skip_pre`` will need to be updated. **Changes** * Supports Python 3.6. * Supports html5lib >= 0.99999999 (8 9s). * There's a ``bleach.sanitizer.Cleaner`` class that you can instantiate with your favorite clean settings for easy reuse. * There's a ``bleach.linkifier.Linker`` class that you can instantiate with your favorite linkify settings for easy reuse. * There's a ``bleach.linkifier.LinkifyFilter`` which is an htm5lib filter that you can pass as a filter to ``bleach.sanitizer.Cleaner`` allowing you to clean and linkify in one pass. * ``bleach.clean`` and friends can now take a callable as an attributes arg value. * Tons of bug fixes. * Cleaned up tests. * Documentation fixes. Version 1.5 (November 4th, 2016) -------------------------------- **Security fixes** * None **Backwards incompatible changes** * clean: The list of ``ALLOWED_PROTOCOLS`` now defaults to http, https and mailto. Previously it was a long list of protocols something like ed2k, ftp, http, https, irc, mailto, news, gopher, nntp, telnet, webcal, xmpp, callto, feed, urn, aim, rsync, tag, ssh, sftp, rtsp, afs, data. #149 **Changes** * clean: Added ``protocols`` to arguments list to let you override the list of allowed protocols. Thank you, Andreas Malecki! #149 * linkify: Fix a bug involving periods at the end of an email address. Thank you, Lorenz Schori! #219 * linkify: Fix linkification of non-ascii ports. Thank you Alexandre, Macabies! #207 * linkify: Fix linkify inappropriately removing node tails when dropping nodes. #132 * Fixed a test that failed periodically. #161 * Switched from nose to py.test. #204 * Add test matrix for all supported Python and html5lib versions. #230 * Limit to html5lib ``>=0.999,!=0.9999,!=0.99999,<0.99999999`` because 0.9999 and 0.99999 are busted. * Add support for ``python setup.py test``. #97 Version 1.4.3 (May 23rd, 2016) ------------------------------ **Security fixes** * None **Changes** * Limit to html5lib ``>=0.999,<0.99999999`` because of impending change to sanitizer api. #195 Version 1.4.2 (September 11, 2015) ---------------------------------- **Changes** * linkify: Fix hang in linkify with ``parse_email=True``. #124 * linkify: Fix crash in linkify when removing a link that is a first-child. #136 * Updated TLDs. * linkify: Don't remove exterior brackets when linkifying. #146 Version 1.4.1 (December 15, 2014) --------------------------------- **Changes** * Consistent order of attributes in output. * Python 3.4 support. Version 1.4 (January 12, 2014) ------------------------------ **Changes** * linkify: Update linkify to use etree type Treewalker instead of simpletree. * Updated html5lib to version ``>=0.999``. * Update all code to be compatible with Python 3 and 2 using six. * Switch to Apache License. Version 1.3 ----------- * Used by Python 3-only fork. Version 1.2.2 (May 18, 2013) ---------------------------- * Pin html5lib to version 0.95 for now due to major API break. Version 1.2.1 (February 19, 2013) --------------------------------- * ``clean()`` no longer considers ``feed:`` an acceptable protocol due to inconsistencies in browser behavior. Version 1.2 (January 28, 2013) ------------------------------ * ``linkify()`` has changed considerably. Many keyword arguments have been replaced with a single callbacks list. Please see the documentation for more information. * Bleach will no longer consider unacceptable protocols when linkifying. * ``linkify()`` now takes a tokenizer argument that allows it to skip sanitization. * ``delinkify()`` is gone. * Removed exception handling from ``_render``. ``clean()`` and ``linkify()`` may now throw. * ``linkify()`` correctly ignores case for protocols and domain names. * ``linkify()`` correctly handles markup within an tag. Version 1.1.5 ------------- Version 1.1.4 ------------- Version 1.1.3 (July 10, 2012) ----------------------------- * Fix parsing bare URLs when parse_email=True. Version 1.1.2 (June 1, 2012) ---------------------------- * Fix hang in style attribute sanitizer. (#61) * Allow ``/`` in style attribute values. Version 1.1.1 (February 17, 2012) --------------------------------- * Fix tokenizer for html5lib 0.9.5. Version 1.1.0 (October 24, 2011) -------------------------------- * ``linkify()`` now understands port numbers. (#38) * Documented character encoding behavior. (#41) * Add an optional target argument to ``linkify()``. * Add ``delinkify()`` method. (#45) * Support subdomain whitelist for ``delinkify()``. (#47, #48) Version 1.0.4 (September 2, 2011) --------------------------------- * Switch to SemVer git tags. * Make ``linkify()`` smarter about trailing punctuation. (#30) * Pass ``exc_info`` to logger during rendering issues. * Add wildcard key for attributes. (#19) * Make ``linkify()`` use the ``HTMLSanitizer`` tokenizer. (#36) * Fix URLs wrapped in parentheses. (#23) * Make ``linkify()`` UTF-8 safe. (#33) Version 1.0.3 (June 14, 2011) ----------------------------- * ``linkify()`` works with 3rd level domains. (#24) * ``clean()`` supports vendor prefixes in style values. (#31, #32) * Fix ``linkify()`` email escaping. Version 1.0.2 (June 6, 2011) ---------------------------- * ``linkify()`` supports email addresses. * ``clean()`` supports callables in attributes filter. Version 1.0.1 (April 12, 2011) ------------------------------ * ``linkify()`` doesn't drop trailing slashes. (#21) * ``linkify()`` won't linkify 'libgl.so.1'. (#22) bleach-2.1.2/CODE_OF_CONDUCT.rst000066400000000000000000000005621321226254300157440ustar00rootroot00000000000000Code of conduct =============== This project and repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details please see the `Mozilla Community Participation Guidelines `_ and `Developer Etiquette Guidelines `_. bleach-2.1.2/CONTRIBUTING.rst000066400000000000000000000012671321226254300154010ustar00rootroot00000000000000Reporting Bugs ============== For regular bugs, please report them `in our issue tracker `_. If you believe that you've found a security vulnerability, please `file a secure bug report in our bug tracker `_ or send an email to *security AT mozilla DOT org*. For more information on security-related bug disclosure and the PGP key to use for sending encrypted mail or to verify responses received from that address, please read our wiki page at ``_. bleach-2.1.2/CONTRIBUTORS000066400000000000000000000021141321226254300146100ustar00rootroot00000000000000Bleach was originally written and maintained by James Socol and various contributors within and without the Mozilla Corporation and Foundation. It is currently maintained by Will Kahn-Greene an Greg Guthe. Maintainers: - Will Kahn-Greene - Greg Guthe Maintainer emeritus: - Jannis Leidel - James Socol Contributors: - Adam Lofts - Adrian "ThiefMaster" - Alek - Alexandre Macabies - Alexandr N. Zamaraev - Alex Defsen - Alex Ehlke - Alireza Savand - Andreas Malecki - Andy Freeland - Anton Kovalyov - Chris Beaven - Dan Gayle - Erik Rose - Gaurav Dadhania - Geoffrey Sneddon - Greg Guthe - Istvan Albert - Jaime Irurzun - James Socol - Jannis Leidel - Janusz Kamieński - Jeff Balogh - Jonathan Vanasco - Lee, Cheon-il - Les Orchard - Lorenz Schori - Luis Nell - Marc Abramowitz - Marc DM - Mark Lee - Mark Paschal - mdxs - nikolas - Oh Jinkyun - Paul Craciunoiu - Ricky Rosario - Ryan Niemeyer - Sébastien Fievet - sedrubal - Tim Dumol - Timothy Fitz - Vitaly Volkov - Will Kahn-Greene - Zoltán - zyegfryed bleach-2.1.2/LICENSE000066400000000000000000000010711321226254300137360ustar00rootroot00000000000000Copyright (c) 2014-2017, Mozilla Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. bleach-2.1.2/MANIFEST.in000066400000000000000000000005161321226254300144720ustar00rootroot00000000000000include CHANGES include CONTRIBUTORS include CONTRIBUTING.rst include CODE_OF_CONDUCT.rst include requirements.txt include tox.ini include LICENSE include README.rst include docs/conf.py include docs/Makefile recursive-include docs *.rst recursive-include tests *.py *.test *.out recursive-include tests_website *.html *.py *.rst bleach-2.1.2/README.rst000066400000000000000000000072641321226254300144320ustar00rootroot00000000000000====== Bleach ====== .. image:: https://travis-ci.org/mozilla/bleach.png?branch=master :target: https://travis-ci.org/mozilla/bleach .. image:: https://badge.fury.io/py/bleach.svg :target: http://badge.fury.io/py/bleach Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes. Bleach can also linkify text safely, applying filters that Django's ``urlize`` filter cannot, and optionally setting ``rel`` attributes, even on links already in the text. Bleach is intended for sanitizing text from *untrusted* sources. If you find yourself jumping through hoops to allow your site administrators to do lots of things, you're probably outside the use cases. Either trust those users, or don't. Because it relies on html5lib_, Bleach is as good as modern browsers at dealing with weird, quirky HTML fragments. And *any* of Bleach's methods will fix unbalanced or mis-nested tags. The version on GitHub_ is the most up-to-date and contains the latest bug fixes. You can find full documentation on `ReadTheDocs`_. :Code: https://github.com/mozilla/bleach :Documentation: https://bleach.readthedocs.io/ :Issue tracker: https://github.com/mozilla/bleach/issues :IRC: ``#bleach`` on irc.mozilla.org :License: Apache License v2; see LICENSE file Reporting Bugs ============== For regular bugs, please report them `in our issue tracker `_. If you believe that you've found a security vulnerability, please `file a secure bug report in our bug tracker `_ or send an email to *security AT mozilla DOT org*. For more information on security-related bug disclosure and the PGP key to use for sending encrypted mail or to verify responses received from that address, please read our wiki page at ``_. Installing Bleach ================= Bleach is available on PyPI_, so you can install it with ``pip``:: $ pip install bleach Or with ``easy_install``:: $ easy_install bleach Upgrading Bleach ================ .. warning:: Before doing any upgrades, read through `Bleach Changes `_ for backwards incompatible changes, newer versions, etc. Basic use ========= The simplest way to use Bleach is: .. code-block:: python >>> import bleach >>> bleach.clean('an example') u'an <script>evil()</script> example' >>> bleach.linkify('an http://example.com url') u'an http://example.com url Security ======== Bleach is a security-related library. We have a responsible security vulnerability reporting process. Please use that if you're reporting a security issue. Security issues are fixed in private. After we land such a fix, we'll do a release. For every release, we mark security issues we've fixed in the ``CHANGES`` in the **Security issues** section. We include relevant CVE links. Code of conduct =============== This project and repository is governed by Mozilla's code of conduct and etiquette guidelines. For more details please see the `Mozilla Community Participation Guidelines `_ and `Developer Etiquette Guidelines `_. .. _html5lib: https://github.com/html5lib/html5lib-python .. _GitHub: https://github.com/mozilla/bleach .. _ReadTheDocs: https://bleach.readthedocs.io/ .. _PyPI: http://pypi.python.org/pypi/bleach bleach-2.1.2/bleach/000077500000000000000000000000001321226254300141505ustar00rootroot00000000000000bleach-2.1.2/bleach/__init__.py000066400000000000000000000102161321226254300162610ustar00rootroot00000000000000# -*- coding: utf-8 -*- from __future__ import unicode_literals import warnings from pkg_resources import parse_version from bleach.linkifier import ( DEFAULT_CALLBACKS, Linker, ) from bleach.sanitizer import ( ALLOWED_ATTRIBUTES, ALLOWED_PROTOCOLS, ALLOWED_STYLES, ALLOWED_TAGS, Cleaner, ) import html5lib try: _html5lib_version = html5lib.__version__.split('.') if len(_html5lib_version) < 2: _html5lib_version = _html5lib_version + ['0'] except Exception: _h5ml5lib_version = ['unknown', 'unknown'] # Bleach 3.0.0 won't support html5lib-python < 1.0.0. if _html5lib_version < ['1', '0'] or 'b' in _html5lib_version[1]: warnings.warn('Support for html5lib-python < 1.0.0 is deprecated.', DeprecationWarning) # yyyymmdd __releasedate__ = '20171207' # x.y.z or x.y.z.dev0 -- semver __version__ = '2.1.2' VERSION = parse_version(__version__) __all__ = ['clean', 'linkify'] def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, styles=ALLOWED_STYLES, protocols=ALLOWED_PROTOCOLS, strip=False, strip_comments=True): """Clean an HTML fragment of malicious content and return it This function is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page. This function is not designed to use to transform content to be used in non-web-page contexts. Example:: import bleach better_text = bleach.clean(yucky_text) .. Note:: If you're cleaning a lot of text and passing the same argument values or you want more configurability, consider using a :py:class:`bleach.sanitizer.Cleaner` instance. :arg str text: the text to clean :arg list tags: allowed list of tags; defaults to ``bleach.sanitizer.ALLOWED_TAGS`` :arg dict attributes: allowed attributes; can be a callable, list or dict; defaults to ``bleach.sanitizer.ALLOWED_ATTRIBUTES`` :arg list styles: allowed list of css styles; defaults to ``bleach.sanitizer.ALLOWED_STYLES`` :arg list protocols: allowed list of protocols for links; defaults to ``bleach.sanitizer.ALLOWED_PROTOCOLS`` :arg bool strip: whether or not to strip disallowed elements :arg bool strip_comments: whether or not to strip HTML comments :returns: cleaned text as unicode """ cleaner = Cleaner( tags=tags, attributes=attributes, styles=styles, protocols=protocols, strip=strip, strip_comments=strip_comments, ) return cleaner.clean(text) def linkify(text, callbacks=DEFAULT_CALLBACKS, skip_tags=None, parse_email=False): """Convert URL-like strings in an HTML fragment to links This function converts strings that look like URLs, domain names and email addresses in text that may be an HTML fragment to links, while preserving: 1. links already in the string 2. urls found in attributes 3. email addresses linkify does a best-effort approach and tries to recover from bad situations due to crazy text. .. Note:: If you're linking a lot of text and passing the same argument values or you want more configurability, consider using a :py:class:`bleach.linkifier.Linker` instance. .. Note:: If you have text that you want to clean and then linkify, consider using the :py:class:`bleach.linkifier.LinkifyFilter` as a filter in the clean pass. That way you're not parsing the HTML twice. :arg str text: the text to linkify :arg list callbacks: list of callbacks to run when adjusting tag attributes; defaults to ``bleach.linkifier.DEFAULT_CALLBACKS`` :arg list skip_tags: list of tags that you don't want to linkify the contents of; for example, you could set this to ``['pre']`` to skip linkifying contents of ``pre`` tags :arg bool parse_email: whether or not to linkify email addresses :returns: linkified text as unicode """ linker = Linker( callbacks=callbacks, skip_tags=skip_tags, parse_email=parse_email ) return linker.linkify(text) bleach-2.1.2/bleach/callbacks.py000066400000000000000000000014441321226254300164440ustar00rootroot00000000000000"""A set of basic callbacks for bleach.linkify.""" from __future__ import unicode_literals def nofollow(attrs, new=False): href_key = (None, u'href') if href_key not in attrs: return attrs if attrs[href_key].startswith(u'mailto:'): return attrs rel_key = (None, u'rel') rel_values = [val for val in attrs.get(rel_key, u'').split(u' ') if val] if u'nofollow' not in [rel_val.lower() for rel_val in rel_values]: rel_values.append(u'nofollow') attrs[rel_key] = u' '.join(rel_values) return attrs def target_blank(attrs, new=False): href_key = (None, u'href') if href_key not in attrs: return attrs if attrs[href_key].startswith(u'mailto:'): return attrs attrs[(None, u'target')] = u'_blank' return attrs bleach-2.1.2/bleach/linkifier.py000066400000000000000000000455411321226254300165070ustar00rootroot00000000000000from __future__ import unicode_literals import re import six import html5lib from html5lib.filters.base import Filter from html5lib.filters.sanitizer import allowed_protocols from html5lib.serializer import HTMLSerializer from bleach import callbacks as linkify_callbacks from bleach.utils import alphabetize_attributes, force_unicode #: List of default callbacks DEFAULT_CALLBACKS = [linkify_callbacks.nofollow] TLDS = """ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bv bw by bz ca cat cc cd cf cg ch ci ck cl cm cn co com coop cr cu cv cx cy cz de dj dk dm do dz ec edu ee eg er es et eu fi fj fk fm fo fr ga gb gd ge gf gg gh gi gl gm gn gov gp gq gr gs gt gu gw gy hk hm hn hr ht hu id ie il im in info int io iq ir is it je jm jo jobs jp ke kg kh ki km kn kp kr kw ky kz la lb lc li lk lr ls lt lu lv ly ma mc md me mg mh mil mk ml mm mn mo mobi mp mq mr ms mt mu museum mv mw mx my mz na name nc ne net nf ng ni nl no np nr nu nz om org pa pe pf pg ph pk pl pm pn post pr pro ps pt pw py qa re ro rs ru rw sa sb sc sd se sg sh si sj sk sl sm sn so sr ss st su sv sx sy sz tc td tel tf tg th tj tk tl tm tn to tp tr travel tt tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xn xxx ye yt yu za zm zw""".split() # Make sure that .com doesn't get matched by .co first TLDS.reverse() def build_url_re(tlds=TLDS, protocols=allowed_protocols): """Builds the url regex used by linkifier If you want a different set of tlds or allowed protocols, pass those in and stomp on the existing ``url_re``:: from bleach import linkifier my_url_re = linkifier.build_url_re(my_tlds_list, my_protocols) linker = LinkifyFilter(url_re=my_url_re) """ return re.compile( r"""\(* # Match any opening parentheses. \b(?"]*)? # /path/zz (excluding "unsafe" chars from RFC 1738, # except for # and ~, which happen in practice) """.format('|'.join(protocols), '|'.join(tlds)), re.IGNORECASE | re.VERBOSE | re.UNICODE) URL_RE = build_url_re() PROTO_RE = re.compile(r'^[\w-]+:/{0,3}', re.IGNORECASE) EMAIL_RE = re.compile( r"""(? ``value`` :arg bool is_new: whether or not this link was added by linkify :returns: adjusted attrs dict or ``None`` """ for cb in self.callbacks: attrs = cb(attrs, is_new) if attrs is None: return None return attrs def extract_character_data(self, token_list): """Extracts and squashes character sequences in a token stream""" # FIXME(willkg): This is a terrible idea. What it does is drop all the # tags from the token list and merge the Characters and SpaceCharacters # tokens into a single text. # # So something like this:: # # "" "" "some text" "" "" # # gets converted to "some text". # # This gets used to figure out the ``_text`` fauxttribute value for # linkify callables. # # I'm not really sure how else to support that ``_text`` fauxttribute and # maintain some modicum of backwards compatability with previous versions # of Bleach. out = [] for token in token_list: token_type = token['type'] if token_type in ['Characters', 'SpaceCharacters']: out.append(token['data']) return u''.join(out) def handle_email_addresses(self, src_iter): """Handle email addresses in character tokens""" for token in src_iter: if token['type'] == 'Characters': text = token['data'] new_tokens = [] end = 0 # For each email address we find in the text for match in self.email_re.finditer(text): if match.start() > end: new_tokens.append( {u'type': u'Characters', u'data': text[end:match.start()]} ) # Run attributes through the callbacks to see what we # should do with this match attrs = { (None, u'href'): u'mailto:%s' % match.group(0), u'_text': match.group(0) } attrs = self.apply_callbacks(attrs, True) if attrs is None: # Just add the text--but not as a link new_tokens.append( {u'type': u'Characters', u'data': match.group(0)} ) else: # Add an "a" tag for the new link _text = attrs.pop(u'_text', '') attrs = alphabetize_attributes(attrs) new_tokens.extend([ {u'type': u'StartTag', u'name': u'a', u'data': attrs}, {u'type': u'Characters', u'data': force_unicode(_text)}, {u'type': u'EndTag', u'name': 'a'} ]) end = match.end() if new_tokens: # Yield the adjusted set of tokens and then continue # through the loop if end < len(text): new_tokens.append({u'type': u'Characters', u'data': text[end:]}) for new_token in new_tokens: yield new_token continue yield token def strip_non_url_bits(self, fragment): """Strips non-url bits from the url This accounts for over-eager matching by the regex. """ prefix = suffix = '' while fragment: # Try removing ( from the beginning and, if it's balanced, from the # end, too if fragment.startswith(u'('): prefix = prefix + u'(' fragment = fragment[1:] if fragment.endswith(u')'): suffix = u')' + suffix fragment = fragment[:-1] continue # Now try extraneous things from the end. For example, sometimes we # pick up ) at the end of a url, but the url is in a parenthesized # phrase like: # # "i looked at the site (at http://example.com)" if fragment.endswith(u')') and u'(' not in fragment: fragment = fragment[:-1] suffix = u')' + suffix continue # Handle commas if fragment.endswith(u','): fragment = fragment[:-1] suffix = u',' + suffix continue # Handle periods if fragment.endswith(u'.'): fragment = fragment[:-1] suffix = u'.' + suffix continue # Nothing matched, so we're done break return fragment, prefix, suffix def handle_links(self, src_iter): """Handle links in character tokens""" in_a = False # happens, if parse_email=True and if a mail was found for token in src_iter: if in_a: if token['type'] == 'EndTag' and token['name'] == 'a': in_a = False yield token continue elif token['type'] == 'StartTag' and token['name'] == 'a': in_a = True yield token continue if token['type'] == 'Characters': text = token['data'] new_tokens = [] end = 0 for match in self.url_re.finditer(text): if match.start() > end: new_tokens.append( {u'type': u'Characters', u'data': text[end:match.start()]} ) url = match.group(0) prefix = suffix = '' # Sometimes we pick up too much in the url match, so look for # bits we should drop and remove them from the match url, prefix, suffix = self.strip_non_url_bits(url) # If there's no protocol, add one if PROTO_RE.search(url): href = url else: href = u'http://%s' % url attrs = { (None, u'href'): href, u'_text': url } attrs = self.apply_callbacks(attrs, True) if attrs is None: # Just add the text new_tokens.append( {u'type': u'Characters', u'data': prefix + url + suffix} ) else: # Add the "a" tag! if prefix: new_tokens.append( {u'type': u'Characters', u'data': prefix} ) _text = attrs.pop(u'_text', '') attrs = alphabetize_attributes(attrs) new_tokens.extend([ {u'type': u'StartTag', u'name': u'a', u'data': attrs}, {u'type': u'Characters', u'data': force_unicode(_text)}, {u'type': u'EndTag', u'name': 'a'}, ]) if suffix: new_tokens.append( {u'type': u'Characters', u'data': suffix} ) end = match.end() if new_tokens: # Yield the adjusted set of tokens and then continue # through the loop if end < len(text): new_tokens.append({u'type': u'Characters', u'data': text[end:]}) for new_token in new_tokens: yield new_token continue yield token def handle_a_tag(self, token_buffer): """Handle the "a" tag This could adjust the link or drop it altogether depending on what the callbacks return. This yields the new set of tokens. """ a_token = token_buffer[0] if a_token['data']: attrs = a_token['data'] else: attrs = {} text = self.extract_character_data(token_buffer) attrs['_text'] = text attrs = self.apply_callbacks(attrs, False) if attrs is None: # We're dropping the "a" tag and everything else and replacing # it with character data. So emit that token. yield {'type': 'Characters', 'data': text} else: new_text = attrs.pop('_text', '') a_token['data'] = alphabetize_attributes(attrs) if text == new_text: # The callbacks didn't change the text, so we yield the new "a" # token, then whatever else was there, then the end "a" token yield a_token for mem in token_buffer[1:]: yield mem else: # If the callbacks changed the text, then we're going to drop # all the tokens between the start and end "a" tags and replace # it with the new text yield a_token yield {'type': 'Characters', 'data': force_unicode(new_text)} yield token_buffer[-1] def __iter__(self): in_a = False in_skip_tag = None token_buffer = [] for token in super(LinkifyFilter, self).__iter__(): if in_a: # Handle the case where we're in an "a" tag--we want to buffer tokens # until we hit an end "a" tag. if token['type'] == 'EndTag' and token['name'] == 'a': # Add the end tag to the token buffer and then handle them # and yield anything returned token_buffer.append(token) for new_token in self.handle_a_tag(token_buffer): yield new_token # Clear "a" related state and continue since we've yielded all # the tokens we're going to yield in_a = False token_buffer = [] continue else: token_buffer.append(token) continue elif token['type'] in ['StartTag', 'EmptyTag']: if token['name'] in self.skip_tags: # Skip tags start a "special mode" where we don't linkify # anything until the end tag. in_skip_tag = token['name'] elif token['name'] == 'a': # The "a" tag is special--we switch to a slurp mode and # slurp all the tokens until the end "a" tag and then # figure out what to do with them there. in_a = True token_buffer.append(token) # We buffer the start tag, so we don't want to yield it, # yet continue elif in_skip_tag and self.skip_tags: # NOTE(willkg): We put this clause here since in_a and # switching in and out of in_a takes precedence. if token['type'] == 'EndTag' and token['name'] == in_skip_tag: in_skip_tag = None elif not in_a and not in_skip_tag and token['type'] == 'Characters': new_stream = iter([token]) if self.parse_email: new_stream = self.handle_email_addresses(new_stream) new_stream = self.handle_links(new_stream) for token in new_stream: yield token # We've already yielded this token, so continue continue yield token bleach-2.1.2/bleach/sanitizer.py000066400000000000000000000522471321226254300165440ustar00rootroot00000000000000from __future__ import unicode_literals from itertools import chain import re import string import six from xml.sax.saxutils import unescape import html5lib from html5lib.constants import ( entities, namespaces, prefixes, tokenTypes, ) try: from html5lib.constants import ReparseException except ImportError: # html5lib-python 1.0 changed the name from html5lib.constants import _ReparseException as ReparseException from html5lib.filters.base import Filter from html5lib.filters import sanitizer from html5lib.serializer import HTMLSerializer from html5lib._tokenizer import HTMLTokenizer from html5lib._trie import Trie from bleach.utils import alphabetize_attributes, force_unicode #: Trie of html entity string -> character representation ENTITIES_TRIE = Trie(entities) #: List of allowed tags ALLOWED_TAGS = [ 'a', 'abbr', 'acronym', 'b', 'blockquote', 'code', 'em', 'i', 'li', 'ol', 'strong', 'ul', ] #: Map of allowed attributes by tag ALLOWED_ATTRIBUTES = { 'a': ['href', 'title'], 'abbr': ['title'], 'acronym': ['title'], } #: List of allowed styles ALLOWED_STYLES = [] #: List of allowed protocols ALLOWED_PROTOCOLS = ['http', 'https', 'mailto'] AMP_SPLIT_RE = re.compile('(&)') #: Invisible characters--0 to and including 31 except 9 (tab), 10 (lf), and 13 (cr) INVISIBLE_CHARACTERS = ''.join([chr(c) for c in chain(range(0, 9), range(11, 13), range(14, 32))]) #: Regexp for characters that are invisible INVISIBLE_CHARACTERS_RE = re.compile( '[' + INVISIBLE_CHARACTERS + ']', re.UNICODE ) #: String to replace invisible characters with. This can be a character, a #: string, or even a function that takes a Python re matchobj INVISIBLE_REPLACEMENT_CHAR = '?' class BleachHTMLTokenizer(HTMLTokenizer): def consumeEntity(self, allowedChar=None, fromAttribute=False): # We don't want to consume and convert entities, so this overrides the # html5lib tokenizer's consumeEntity so that it's now a no-op. # # However, when that gets called, it's consumed an &, so we put that in # the steam. if fromAttribute: self.currentToken['data'][-1][1] += '&' else: self.tokenQueue.append({"type": tokenTypes['Characters'], "data": '&'}) class BleachHTMLParser(html5lib.HTMLParser): def _parse(self, stream, innerHTML=False, container="div", scripting=False, **kwargs): # Override HTMLParser so we can swap out the tokenizer for our own. self.innerHTMLMode = innerHTML self.container = container self.scripting = scripting self.tokenizer = BleachHTMLTokenizer(stream, parser=self, **kwargs) self.reset() try: self.mainLoop() except ReparseException: self.reset() self.mainLoop() class Cleaner(object): """Cleaner for cleaning HTML fragments of malicious content This cleaner is a security-focused function whose sole purpose is to remove malicious content from a string such that it can be displayed as content in a web page. This cleaner is not designed to use to transform content to be used in non-web-page contexts. To use:: from bleach.sanitizer import Cleaner cleaner = Cleaner() for text in all_the_yucky_things: sanitized = cleaner.clean(text) """ def __init__(self, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, styles=ALLOWED_STYLES, protocols=ALLOWED_PROTOCOLS, strip=False, strip_comments=True, filters=None): """Initializes a Cleaner :arg list tags: allowed list of tags; defaults to ``bleach.sanitizer.ALLOWED_TAGS`` :arg dict attributes: allowed attributes; can be a callable, list or dict; defaults to ``bleach.sanitizer.ALLOWED_ATTRIBUTES`` :arg list styles: allowed list of css styles; defaults to ``bleach.sanitizer.ALLOWED_STYLES`` :arg list protocols: allowed list of protocols for links; defaults to ``bleach.sanitizer.ALLOWED_PROTOCOLS`` :arg bool strip: whether or not to strip disallowed elements :arg bool strip_comments: whether or not to strip HTML comments :arg list filters: list of html5lib Filter classes to pass streamed content through .. seealso:: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters .. Warning:: Using filters changes the output of ``bleach.Cleaner.clean``. Make sure the way the filters change the output are secure. """ self.tags = tags self.attributes = attributes self.styles = styles self.protocols = protocols self.strip = strip self.strip_comments = strip_comments self.filters = filters or [] self.parser = BleachHTMLParser(namespaceHTMLElements=False) self.walker = html5lib.getTreeWalker('etree') self.serializer = BleachHTMLSerializer( quote_attr_values='always', omit_optional_tags=False, escape_lt_in_attrs=True, # We want to leave entities as they are without escaping or # resolving or expanding resolve_entities=False, # Bleach has its own sanitizer, so don't use the html5lib one sanitize=False, # Bleach sanitizer alphabetizes already, so don't use the html5lib one alphabetical_attributes=False, ) def clean(self, text): """Cleans text and returns sanitized result as unicode :arg str text: text to be cleaned :returns: sanitized text as unicode :raises TypeError: if ``text`` is not a text type """ if not isinstance(text, six.string_types): message = "argument cannot be of '{name}' type, must be of text type".format( name=text.__class__.__name__) raise TypeError(message) if not text: return u'' text = force_unicode(text) dom = self.parser.parseFragment(text) filtered = BleachSanitizerFilter( source=self.walker(dom), # Bleach-sanitizer-specific things attributes=self.attributes, strip_disallowed_elements=self.strip, strip_html_comments=self.strip_comments, # html5lib-sanitizer things allowed_elements=self.tags, allowed_css_properties=self.styles, allowed_protocols=self.protocols, allowed_svg_properties=[], ) # Apply any filters after the BleachSanitizerFilter for filter_class in self.filters: filtered = filter_class(source=filtered) return self.serializer.render(filtered) def attribute_filter_factory(attributes): """Generates attribute filter function for the given attributes value The attributes value can take one of several shapes. This returns a filter function appropriate to the attributes value. One nice thing about this is that there's less if/then shenanigans in the ``allow_token`` method. """ if callable(attributes): return attributes if isinstance(attributes, dict): def _attr_filter(tag, attr, value): if tag in attributes: attr_val = attributes[tag] if callable(attr_val): return attr_val(tag, attr, value) if attr in attr_val: return True if '*' in attributes: attr_val = attributes['*'] if callable(attr_val): return attr_val(tag, attr, value) return attr in attr_val return False return _attr_filter if isinstance(attributes, list): def _attr_filter(tag, attr, value): return attr in attributes return _attr_filter raise ValueError('attributes needs to be a callable, a list or a dict') def match_entity(stream): """Returns first entity in stream or None if no entity exists Note: For Bleach purposes, entities must start with a "&" and end with a ";". :arg stream: the character stream :returns: ``None`` or the entity string without "&" or ";" """ # Nix the & at the beginning if stream[0] != '&': raise ValueError('Stream should begin with "&"') stream = stream[1:] stream = list(stream) possible_entity = '' end_characters = '<&=;' + string.whitespace # Handle number entities if stream and stream[0] == '#': possible_entity = '#' stream.pop(0) if stream and stream[0] in ('x', 'X'): allowed = '0123456789abcdefABCDEF' possible_entity += stream.pop(0) else: allowed = '0123456789' # FIXME(willkg): Do we want to make sure these are valid number # entities? This doesn't do that currently. while stream and stream[0] not in end_characters: c = stream.pop(0) if c not in allowed: break possible_entity += c if possible_entity and stream and stream[0] == ';': return possible_entity return None # Handle character entities while stream and stream[0] not in end_characters: c = stream.pop(0) if not ENTITIES_TRIE.has_keys_with_prefix(possible_entity): break possible_entity += c if possible_entity and stream and stream[0] == ';': return possible_entity return None def next_possible_entity(text): """Takes a text and generates a list of possible entities :arg text: the text to look at :returns: generator where each part (except the first) starts with an "&" """ for i, part in enumerate(AMP_SPLIT_RE.split(text)): if i == 0: yield part elif i % 2 == 0: yield '&' + part class BleachSanitizerFilter(sanitizer.Filter): """html5lib Filter that sanitizes text This filter can be used anywhere html5lib filters can be used. """ def __init__(self, source, attributes=ALLOWED_ATTRIBUTES, strip_disallowed_elements=False, strip_html_comments=True, **kwargs): """Creates a BleachSanitizerFilter instance :arg Treewalker source: stream :arg list tags: allowed list of tags; defaults to ``bleach.sanitizer.ALLOWED_TAGS`` :arg dict attributes: allowed attributes; can be a callable, list or dict; defaults to ``bleach.sanitizer.ALLOWED_ATTRIBUTES`` :arg list styles: allowed list of css styles; defaults to ``bleach.sanitizer.ALLOWED_STYLES`` :arg list protocols: allowed list of protocols for links; defaults to ``bleach.sanitizer.ALLOWED_PROTOCOLS`` :arg bool strip_disallowed_elements: whether or not to strip disallowed elements :arg bool strip_html_comments: whether or not to strip HTML comments """ self.attr_filter = attribute_filter_factory(attributes) self.strip_disallowed_elements = strip_disallowed_elements self.strip_html_comments = strip_html_comments return super(BleachSanitizerFilter, self).__init__(source, **kwargs) def __iter__(self): for token in Filter.__iter__(self): ret = self.sanitize_token(token) if not ret: continue if isinstance(ret, list): for subtoken in ret: yield subtoken else: yield ret def sanitize_token(self, token): """Sanitize a token either by HTML-encoding or dropping. Unlike sanitizer.Filter, allowed_attributes can be a dict of {'tag': ['attribute', 'pairs'], 'tag': callable}. Here callable is a function with two arguments of attribute name and value. It should return true of false. Also gives the option to strip tags instead of encoding. :arg dict token: token to sanitize :returns: token or list of tokens """ token_type = token['type'] if token_type in ['StartTag', 'EndTag', 'EmptyTag']: if token['name'] in self.allowed_elements: return self.allow_token(token) elif self.strip_disallowed_elements: return None else: if 'data' in token: # Alphabetize the attributes before calling .disallowed_token() # so that the resulting string is stable token['data'] = alphabetize_attributes(token['data']) return self.disallowed_token(token) elif token_type == 'Comment': if not self.strip_html_comments: return token else: return None elif token_type == 'Characters': return self.sanitize_characters(token) else: return token def sanitize_characters(self, token): """Handles Characters tokens Our overridden tokenizer doesn't do anything with entities. However, that means that the serializer will convert all ``&`` in Characters tokens to ``&``. Since we don't want that, we extract entities here and convert them to Entity tokens so the serializer will let them be. :arg token: the Characters token to work on :returns: a list of tokens """ data = token.get('data', '') if not data: return token data = INVISIBLE_CHARACTERS_RE.sub(INVISIBLE_REPLACEMENT_CHAR, data) token['data'] = data # If there isn't a & in the data, we can return now if '&' not in data: return token new_tokens = [] # For each possible entity that starts with a "&", we try to extract an # actual entity and re-tokenize accordingly for part in next_possible_entity(data): if not part: continue if part.startswith('&'): entity = match_entity(part) if entity is not None: new_tokens.append({'type': 'Entity', 'name': entity}) # Length of the entity plus 2--one for & at the beginning # and and one for ; at the end part = part[len(entity) + 2:] if part: new_tokens.append({'type': 'Characters', 'data': part}) continue new_tokens.append({'type': 'Characters', 'data': part}) return new_tokens def allow_token(self, token): """Handles the case where we're allowing the tag""" if 'data' in token: # Loop through all the attributes and drop the ones that are not # allowed, are unsafe or break other rules. Additionally, fix # attribute values that need fixing. # # At the end of this loop, we have the final set of attributes # we're keeping. attrs = {} for namespaced_name, val in token['data'].items(): namespace, name = namespaced_name # Drop attributes that are not explicitly allowed # # NOTE(willkg): We pass in the attribute name--not a namespaced # name. if not self.attr_filter(token['name'], name, val): continue # Look at attributes that have uri values if namespaced_name in self.attr_val_is_uri: val_unescaped = re.sub( "[`\000-\040\177-\240\s]+", '', unescape(val)).lower() # Remove replacement characters from unescaped characters. val_unescaped = val_unescaped.replace("\ufffd", "") # Drop attributes with uri values that have protocols that # aren't allowed if (re.match(r'^[a-z0-9][-+.a-z0-9]*:', val_unescaped) and (val_unescaped.split(':')[0] not in self.allowed_protocols)): continue # Drop values in svg attrs with non-local IRIs if namespaced_name in self.svg_attr_val_allows_ref: new_val = re.sub(r'url\s*\(\s*[^#\s][^)]+?\)', ' ', unescape(val)) new_val = new_val.strip() if not new_val: continue else: # Replace the val with the unescaped version because # it's a iri val = new_val # Drop href and xlink:href attr for svg elements with non-local IRIs if (None, token['name']) in self.svg_allow_local_href: if namespaced_name in [(None, 'href'), (namespaces['xlink'], 'href')]: if re.search(r'^\s*[^#\s]', val): continue # If it's a style attribute, sanitize it if namespaced_name == (None, u'style'): val = self.sanitize_css(val) # At this point, we want to keep the attribute, so add it in attrs[namespaced_name] = val token['data'] = alphabetize_attributes(attrs) return token def disallowed_token(self, token): token_type = token["type"] if token_type == "EndTag": token["data"] = "" % token["name"] elif token["data"]: assert token_type in ("StartTag", "EmptyTag") attrs = [] for (ns, name), v in token["data"].items(): attrs.append(' %s="%s"' % ( name if ns is None else "%s:%s" % (prefixes[ns], name), # NOTE(willkg): HTMLSerializer escapes attribute values # already, so if we do it here (like HTMLSerializer does), # then we end up double-escaping. v) ) token["data"] = "<%s%s>" % (token["name"], ''.join(attrs)) else: token["data"] = "<%s>" % token["name"] if token.get("selfClosing"): token["data"] = token["data"][:-1] + "/>" token["type"] = "Characters" del token["name"] return token def sanitize_css(self, style): """Sanitizes css in style tags""" # disallow urls style = re.compile('url\s*\(\s*[^\s)]+?\s*\)\s*').sub(' ', style) # gauntlet # Validate the css in the style tag and if it's not valid, then drop # the whole thing. parts = style.split(';') gauntlet = re.compile( r"""^([-/:,#%.'"\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'\s*|"[\s\w]+"|\([\d,%\.\s]+\))*$""" ) for part in parts: if not gauntlet.match(part): return '' if not re.match("^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$", style): return '' clean = [] for prop, value in re.findall('([-\w]+)\s*:\s*([^:;]*)', style): if not value: continue if prop.lower() in self.allowed_css_properties: clean.append(prop + ': ' + value + ';') elif prop.lower() in self.allowed_svg_properties: clean.append(prop + ': ' + value + ';') return ' '.join(clean) class BleachHTMLSerializer(HTMLSerializer): """Wraps the HTMLSerializer and undoes & -> & in attributes""" def escape_base_amp(self, stoken): """Escapes bare & in HTML attribute values""" # First, undo what the HTMLSerializer did stoken = stoken.replace('&', '&') # Then, escape any bare & for part in next_possible_entity(stoken): if not part: continue if part.startswith('&'): entity = match_entity(part) if entity is not None: yield '&' + entity + ';' # Length of the entity plus 2--one for & at the beginning # and and one for ; at the end part = part[len(entity) + 2:] if part: yield part continue yield part.replace('&', '&') def serialize(self, treewalker, encoding=None): """Wrap HTMLSerializer.serialize and escape bare & in attributes""" in_tag = False after_equals = False for stoken in super(BleachHTMLSerializer, self).serialize(treewalker, encoding): if in_tag: if stoken == '>': in_tag = False elif after_equals: if stoken != '"': for part in self.escape_base_amp(stoken): yield part after_equals = False continue elif stoken == '=': after_equals = True yield stoken else: if stoken.startswith('<'): in_tag = True yield stoken bleach-2.1.2/bleach/utils.py000066400000000000000000000021331321226254300156610ustar00rootroot00000000000000from collections import OrderedDict import six def _attr_key(attr): """Returns appropriate key for sorting attribute names Attribute names are a tuple of ``(namespace, name)`` where namespace can be ``None`` or a string. These can't be compared in Python 3, so we conver the ``None`` to an empty string. """ key = (attr[0][0] or ''), attr[0][1] return key def alphabetize_attributes(attrs): """Takes a dict of attributes (or None) and returns them alphabetized""" if not attrs: return attrs return OrderedDict( [(k, v) for k, v in sorted(attrs.items(), key=_attr_key)] ) def force_unicode(text): """Takes a text (Python 2: str/unicode; Python 3: unicode) and converts to unicode :arg str/unicode text: the text in question :returns: text as unicode :raises UnicodeDecodeError: if the text was a Python 2 str and isn't in utf-8 """ # If it's already unicode, then return it if isinstance(text, six.text_type): return text # If not, convert it return six.text_type(text, 'utf-8', 'strict') bleach-2.1.2/docs/000077500000000000000000000000001321226254300136625ustar00rootroot00000000000000bleach-2.1.2/docs/Makefile000066400000000000000000000126741321226254300153340ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Bleach.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Bleach.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/Bleach" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Bleach" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." bleach-2.1.2/docs/changes.rst000066400000000000000000000000561321226254300160250ustar00rootroot00000000000000.. _changes-chapter: .. include:: ../CHANGES bleach-2.1.2/docs/clean.rst000066400000000000000000000240031321226254300154750ustar00rootroot00000000000000.. _clean-chapter: .. highlightlang:: python ========================= Sanitizing text fragments ========================= :py:func:`bleach.clean` is Bleach's HTML sanitization method. Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing algorithm and sanitize any disallowed tags or attributes. This algorithm also takes care of things like unclosed and (some) misnested tags. You may pass in a ``string`` or a ``unicode`` object, but Bleach will always return ``unicode``. .. Note:: :py:func:`bleach.clean` is for sanitizing HTML **fragments** and not entire HTML documents. .. Warning:: :py:func:`bleach.clean` is for sanitising HTML fragments to use in an HTML context--not for HTML attributes, CSS, JSON, xhtml, SVG, or other contexts. For example, this is a safe use of ``clean`` output in an HTML context::

{{ bleach.clean(user_bio) }}

This is a **not safe** use of ``clean`` output in an HTML attribute:: If you need to use the output of ``bleach.clean()`` in an HTML attribute, you need to pass it through your template library's escape function. For example, Jinja2's ``escape`` or ``django.utils.html.escape`` or something like that. If you need to use the output of ``bleach.clean()`` in any other context, you need to pass it through an appropriate sanitizer/escaper for that context. .. autofunction:: bleach.clean Allowed tags (``tags``) ======================= The ``tags`` kwarg specifies the allowed set of HTML tags. It should be a list, tuple, or other iterable. Any HTML tags not in this list will be escaped or stripped from the text. For example: .. doctest:: >>> import bleach >>> bleach.clean( ... u'an example', ... tags=['b'], ... ) u'<i>an example</i>' The default value is a relatively conservative list found in ``bleach.sanitizer.ALLOWED_TAGS``. .. autodata:: bleach.sanitizer.ALLOWED_TAGS Allowed Attributes (``attributes``) =================================== The ``attributes`` kwarg lets you specify which attributes are allowed. The value can be a list, a callable or a map of tag name to list or callable. The default value is also a conservative dict found in ``bleach.sanitizer.ALLOWED_ATTRIBUTES``. .. autodata:: bleach.sanitizer.ALLOWED_ATTRIBUTES .. versionchanged:: 2.0 Prior to 2.0, the ``attributes`` kwarg value could only be a list or a map. As a list --------- The ``attributes`` value can be a list which specifies the list of attributes allowed for any tag. For example: .. doctest:: >>> import bleach >>> bleach.clean( ... u'

blah blah blah

', ... tags=['p'], ... attributes=['style'], ... styles=['color'], ... ) u'

blah blah blah

' As a dict --------- The ``attributes`` value can be a dict which maps tags to what attributes they can have. You can also specify ``*``, which will match any tag. For example, this allows "href" and "rel" for "a" tags, "alt" for the "img" tag and "class" for any tag (including "a" and "img"): .. doctest:: >>> import bleach >>> attrs = { ... '*': ['class'], ... 'a': ['href', 'rel'], ... 'img': ['alt'], ... } >>> bleach.clean( ... u'an example', ... tags=['img'], ... attributes=attrs ... ) u'an example' Using functions --------------- You can also use callables that take the tag, attribute name and attribute value and returns ``True`` to keep the attribute or ``False`` to drop it. You can pass a callable as the attributes argument value and it'll run for every tag/attr. For example: .. doctest:: >>> import bleach >>> def allow_h(tag, name, value): ... return name[0] == 'h' >>> bleach.clean( ... u'link', ... tags=['a'], ... attributes=allow_h, ... ) u'link' You can also pass a callable as a value in an attributes dict and it'll run for attributes for specified tags: .. doctest:: >>> from urlparse import urlparse >>> import bleach >>> def allow_src(tag, name, value): ... if name in ('alt', 'height', 'width'): ... return True ... if name == 'src': ... p = urlparse(value) ... return (not p.netloc) or p.netloc == 'mydomain.com' ... return False >>> bleach.clean( ... u'an example', ... tags=['img'], ... attributes={ ... 'img': allow_src ... } ... ) u'an example' .. versionchanged:: 2.0 In previous versions of Bleach, the callable took an attribute name and a attribute value. Now it takes a tag, an attribute name and an attribute value. Allowed styles (``styles``) =========================== If you allow the ``style`` attribute, you will also need to specify the allowed styles users are allowed to set, for example ``color`` and ``background-color``. The default value is an empty list. In other words, the ``style`` attribute will be allowed but no style declaration names will be allowed. For example, to allow users to set the color and font-weight of text: .. doctest:: >>> import bleach >>> tags = ['p', 'em', 'strong'] >>> attrs = { ... '*': ['style'] ... } >>> styles = ['color', 'font-weight'] >>> bleach.clean( ... u'

my html

', ... tags=tags, ... attributes=attrs, ... styles=styles ... ) u'

my html

' Default styles are stored in ``bleach.sanitizer.ALLOWED_STYLES``. .. autodata:: bleach.sanitizer.ALLOWED_STYLES Allowed protocols (``protocols``) ================================= If you allow tags that have attributes containing a URI value (like the ``href`` attribute of an anchor tag, you may want to adapt the accepted protocols. For example, this sets allowed protocols to http, https and smb: .. doctest:: >>> import bleach >>> bleach.clean( ... 'allowed protocol', ... protocols=['http', 'https', 'smb'] ... ) u'allowed protocol' This adds smb to the Bleach-specified set of allowed protocols: .. doctest:: >>> import bleach >>> bleach.clean( ... 'allowed protocol', ... protocols=bleach.ALLOWED_PROTOCOLS + ['smb'] ... ) u'allowed protocol' Default protocols are in ``bleach.sanitizer.ALLOWED_PROTOCOLS``. .. autodata:: bleach.sanitizer.ALLOWED_PROTOCOLS Stripping markup (``strip``) ============================ By default, Bleach *escapes* tags that aren't specified in the allowed tags list and invalid markup. For example: .. doctest:: >>> import bleach >>> bleach.clean('is not allowed') u'<span>is not allowed</span>' >>> bleach.clean('is not allowed', tags=['b']) u'<span>is not allowed</span>' If you would rather Bleach stripped this markup entirely, you can pass ``strip=True``: .. doctest:: >>> import bleach >>> bleach.clean('is not allowed', strip=True) u'is not allowed' >>> bleach.clean('is not allowed', tags=['b'], strip=True) u'is not allowed' Stripping comments (``strip_comments``) ======================================= By default, Bleach will strip out HTML comments. To disable this behavior, set ``strip_comments=False``: .. doctest:: >>> import bleach >>> html = 'my html' >>> bleach.clean(html) u'my html' >>> bleach.clean(html, strip_comments=False) u'my html' Using ``bleach.sanitizer.Cleaner`` ================================== If you're cleaning a lot of text or you need better control of things, you should create a :py:class:`bleach.sanitizer.Cleaner` instance. .. autoclass:: bleach.sanitizer.Cleaner :members: .. versionadded:: 2.0 html5lib Filters (``filters``) ------------------------------ Bleach sanitizing is implemented as an html5lib filter. The consequence of this is that we can pass the streamed content through additional specified filters after the :py:class:`bleach.sanitizer.BleachSanitizingFilter` filter has run. This lets you add data, drop data and change data as it is being serialized back to a unicode. Documentation on html5lib Filters is here: http://html5lib.readthedocs.io/en/latest/movingparts.html#filters Trivial Filter example: .. doctest:: >>> from bleach.sanitizer import Cleaner >>> from html5lib.filters.base import Filter >>> class MooFilter(Filter): ... def __iter__(self): ... for token in Filter.__iter__(self): ... if token['type'] in ['StartTag', 'EmptyTag'] and token['data']: ... for attr, value in token['data'].items(): ... token['data'][attr] = 'moo' ... yield token ... >>> ATTRS = { ... 'img': ['rel', 'src'] ... } ... >>> TAGS = ['img'] >>> cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter]) >>> dirty = 'this is cute! ' >>> cleaner.clean(dirty) u'this is cute! ' .. Warning:: Filters change the output of cleaning. Make sure that whatever changes the filter is applying maintain the safety guarantees of the output. .. versionadded:: 2.0 Using ``bleach.sanitizer.BleachSanitizerFilter`` ================================================ ``bleach.clean`` creates a ``bleach.sanitizer.Cleaner`` which creates a ``bleach.sanitizer.BleachSanitizerFilter`` which does the sanitizing work. ``BleachSanitizerFilter`` is an html5lib filter and can be used anywhere you can use an html5lib filter. .. autoclass:: bleach.sanitizer.BleachSanitizerFilter .. versionadded:: 2.0 bleach-2.1.2/docs/conf.py000066400000000000000000000174001321226254300151630ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # Bleach documentation build configuration file, created by # sphinx-quickstart on Fri May 11 21:11:39 2012. # # This file is execfile()d with the current directory set to its containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import os import sys sys.path.insert(0, os.path.abspath('..')) import bleach # noqa # -- General configuration ----------------------------------------------------- # If your documentation needs a minimal Sphinx version, state it here. # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['sphinx.ext.autodoc', 'sphinx.ext.viewcode', 'sphinx.ext.doctest'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. # source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'Bleach' copyright = u'2012-2015, James Socol; 2015-2017, Mozilla Foundation' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = bleach.__version__ # The full version, including alpha/beta/rc tags. release = bleach.__version__ + ' ' + bleach.__releasedate__ # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: # today = '' # Else, today_fmt is used as the format for a strftime call. # today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all documents. # default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. # add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). # add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. # show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. # modindex_common_prefix = [] # -- Options for autodoc ----------- # Display the class docstring and __init__ docstring concatenated autoclass_content = 'both' # -- Options for HTML output --------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'alabaster' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. html_sidebars = { '**': [ 'about.html', 'navigation.html', 'relations.html', 'searchbox.html', ] } # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'Bleachdoc' # -- Options for LaTeX output -------------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass [howto/manual]). latex_documents = [ ('index', 'Bleach.tex', u'Bleach Documentation', u'Will Kahn-Greene', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output -------------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'bleach', u'Bleach Documentation', [u'Will Kahn-Greene'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------------ # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'Bleach', u'Bleach Documentation', u'Will Kahn-Greene', 'Bleach', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' bleach-2.1.2/docs/dev.rst000066400000000000000000000035041321226254300151740ustar00rootroot00000000000000================== Bleach development ================== Install for development ======================= To install Bleach to make changes to it: 1. Clone the repo from GitHub:: $ git clone git://github.com/mozilla/bleach.git 2. Create a virtual environment using whatever method you want. 3. Install Bleach into the virtual environment such that you can see changes:: $ pip install -e . .. include:: ../CONTRIBUTING.rst .. include:: ../CODE_OF_CONDUCT.rst Docs ==== Docs are in ``docs/``. We use Sphinx. Docs are pushed to ReadTheDocs via a GitHub webhook. Testing ======= Run:: $ tox That'll run Bleach tests in all the supported Python environments. Note that you need the necessary Python binaries for them all to be tested. Tests are run in Travis CI via a GitHub webhook. Release process =============== 1. Checkout master tip. 2. Check to make sure ``setup.py`` and ``requirements.txt`` are correct and match requirements-wise. 3. Update version numbers in ``bleach/__init__.py``. 1. Set ``__version__`` to something like ``2.0``. 2. Set ``__releasedate__`` to something like ``20120731``. 4. Update ``CONTRIBUTORS``, ``CHANGES`` and ``MANIFEST.in``. 5. Verify correctness. 1. Run tests with tox:: $ tox 2. Build the docs:: $ cd docs $ make html 3. Run the doctests:: $ cd docs/ $ make doctests 4. Verify everything works 6. Commit the changes. 7. Push the changes to GitHub. This will cause Travis to run the tests. 8. After Travis is happy, create a signed tag for the release:: $ git tag -s v0.4 Copy the details from ``CHANGES`` into the tag comment. 9. Push the new tag:: $ git push --tags official master That will push the release to PyPI. 10. Blog posts, twitter, update topic in ``#bleach``, etc. bleach-2.1.2/docs/goals.rst000066400000000000000000000142131321226254300155220ustar00rootroot00000000000000=============== Goals of Bleach =============== This document lists the goals and non-goals of Bleach. My hope is that by focusing on these goals and explicitly listing the non-goals, the project will evolve in a stronger direction. .. contents:: Goals ===== Always take a allowed-list-based approach ----------------------------------------- Bleach should always take a allowed-list-based approach to markup filtering. Specifying disallowed lists is error-prone and not future proof. For example, you should have to opt-in to allowing the ``onclick`` attribute, not opt-out of all the other ``on*`` attributes. Future versions of HTML may add new event handlers, like ``ontouch``, that old disallow would not prevent. Main goal is to sanitize input of malicious content --------------------------------------------------- The primary goal of Bleach is to sanitize user input that is allowed to contain *some* HTML as markup and is to be included in the content of a larger page in an HTML context. Examples of such content might include: * User comments on a blog. * "Bio" sections of a user profile. * Descriptions of a product or application. These examples, and others, are traditionally prone to security issues like XSS or other script injection, or annoying issues like unclosed tags and invalid markup. Bleach will take a proactive, allowed-list-only approach to allowing HTML content, and will use the HTML5 parsing algorithm to handle invalid markup. See the :ref:`chapter on clean() ` for more info. Safely create links ------------------- The secondary goal of Bleach is to provide a mechanism for finding or altering links (```` tags with ``href`` attributes, or things that look like URLs or email addresses) in text. While Bleach itself will always operate on a allowed-list-based security model, the :ref:`linkify() method ` is flexible enough to allow the creation, alteration, and removal of links based on an extremely wide range of use cases. Non-Goals ========= Bleach is designed to work with fragments of HTML by untrusted users. Some non-goal use cases include: Sanitize complete HTML documents -------------------------------- Bleach's ``clean`` is not for sanitizing entire HTML documents. Once you're creating whole documents, you have to allow so many tags that a disallow-list approach (e.g. forbidding ``& bleach-2.1.2/tests/data/1.test.out000066400000000000000000000000671321226254300166570ustar00rootroot00000000000000>"><script>alert("XSS")</script>&bleach-2.1.2/tests/data/10.test000066400000000000000000000000451321226254300161250ustar00rootroot00000000000000 bleach-2.1.2/tests/data/10.test.out000066400000000000000000000000521321226254300167310ustar00rootroot00000000000000<img src="javascript:alert('XSS');">bleach-2.1.2/tests/data/11.test000066400000000000000000000000421321226254300161230ustar00rootroot00000000000000 bleach-2.1.2/tests/data/11.test.out000066400000000000000000000000511321226254300167310ustar00rootroot00000000000000<img src="javascript:alert('XSS')">bleach-2.1.2/tests/data/12.test000066400000000000000000000000421321226254300161240ustar00rootroot00000000000000 bleach-2.1.2/tests/data/12.test.out000066400000000000000000000000511321226254300167320ustar00rootroot00000000000000<img src="JaVaScRiPt:alert('XSS')">bleach-2.1.2/tests/data/13.test000066400000000000000000000000611321226254300161260ustar00rootroot00000000000000")> bleach-2.1.2/tests/data/13.test.out000066400000000000000000000001031321226254300167310ustar00rootroot00000000000000<img src="JaVaScRiPt:alert(&quot;XSS<WBR">")> bleach-2.1.2/tests/data/14.test000066400000000000000000000001261321226254300161310ustar00rootroot00000000000000#115;crip&#116;:a bleach-2.1.2/tests/data/14.test.out000066400000000000000000000003131321226254300167350ustar00rootroot00000000000000<imgsrc=&#106;&#97;&#118;&#97;&<wbr>#115;crip&<wbr></wbr>#116;:a</imgsrc=&#106;&#97;&#118;&#97;&<wbr> bleach-2.1.2/tests/data/15.test000066400000000000000000000001061321226254300161300ustar00rootroot00000000000000le&#114;t('XS;S')> bleach-2.1.2/tests/data/15.test.out000066400000000000000000000001711321226254300167400ustar00rootroot00000000000000le&<wbr></wbr>#114;t('X&#83<wbr></wbr>;S'&#41> bleach-2.1.2/tests/data/16.test000066400000000000000000000003741321226254300161400ustar00rootroot00000000000000#0000118as&#0000099ri&#0000112t:&#0000097le&#0000114t(&#0000039XS&#0000083')> bleach-2.1.2/tests/data/16.test.out000066400000000000000000000010061321226254300167370ustar00rootroot00000000000000<imgsrc=&#0000106&#0000097&<wbr>#0000118&#0000097&#0000115&<wbr></wbr>#0000099&#0000114&#0000105&<wbr></wbr>#0000112&#0000116&#0000058&<wbr></wbr>#0000097&#0000108&#0000101&<wbr></wbr>#0000114&#0000116&#0000040&<wbr></wbr>#0000039&#0000088&#0000083&<wbr></wbr>#0000083&#0000039&#0000041></imgsrc=&#0000106&#0000097&<wbr> bleach-2.1.2/tests/data/17.test000066400000000000000000000002141321226254300161320ustar00rootroot00000000000000#x63ript:&#x61lert(&#x27XSS')> bleach-2.1.2/tests/data/17.test.out000066400000000000000000000005411321226254300167430ustar00rootroot00000000000000<imgsrc=&#x6a&#x61&#x76&#x61&#x73&<wbr>#x63&#x72&#x69&#x70&#x74&#x3A&<wbr></wbr>#x61&#x6C&#x65&#x72&#x74&#x28&<wbr></wbr>#x27&#x58&#x53&#x53&#x27&#x29></imgsrc=&#x6a&#x61&#x76&#x61&#x73&<wbr> bleach-2.1.2/tests/data/18.test000066400000000000000000000000601321226254300161320ustar00rootroot00000000000000 bleach-2.1.2/tests/data/18.test.out000066400000000000000000000000771321226254300167500ustar00rootroot00000000000000<img src="jav&#x09;ascript:alert(<WBR>'XSS');">bleach-2.1.2/tests/data/19.test000066400000000000000000000000601321226254300161330ustar00rootroot00000000000000 bleach-2.1.2/tests/data/19.test.out000066400000000000000000000000771321226254300167510ustar00rootroot00000000000000<img src="jav&#x0A;ascript:alert(<WBR>'XSS');">bleach-2.1.2/tests/data/2.test000066400000000000000000000000631321226254300160460ustar00rootroot00000000000000"> bleach-2.1.2/tests/data/2.test.out000066400000000000000000000001011321226254300166450ustar00rootroot00000000000000"><style>@import"javascript:alert('XSS')";</style>bleach-2.1.2/tests/data/20.test000066400000000000000000000000601321226254300161230ustar00rootroot00000000000000 bleach-2.1.2/tests/data/20.test.out000066400000000000000000000001001321226254300167240ustar00rootroot00000000000000<img src="jav&#x0D;ascript:alert(<WBR>'XSS');"> bleach-2.1.2/tests/data/3.test000066400000000000000000000003071321226254300160500ustar00rootroot00000000000000>"'> bleach-2.1.2/tests/data/3.test.out000066400000000000000000000006331321226254300166600ustar00rootroot00000000000000>"'><img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)></img%20src%3d%26%23x6a;%26%23x61;%26%23x76;%26%23x61;%26%23x73;%26%23x63;%26%23x72;%26%23x69;%26%23x70;%26%23x74;%26%23x3a;alert(%26quot;%26%23x20;xss%26%23x20;test%26%23x20;successful%26quot;)>bleach-2.1.2/tests/data/4.test000066400000000000000000000001431321226254300160470ustar00rootroot00000000000000ipt type="text/javascript">alert("foo");script> bleach-2.1.2/tests/data/4.test.out000066400000000000000000000001701321226254300166550ustar00rootroot00000000000000<scr<script>ipt type="text/javascript">alert("foo");script<del></del>></scr<script> bleach-2.1.2/tests/data/5.test000066400000000000000000000000731321226254300160520ustar00rootroot00000000000000>%22%27> bleach-2.1.2/tests/data/5.test.out000066400000000000000000000001771321226254300166650ustar00rootroot00000000000000>%22%27><img%20src%3d%22javascript:alert(%27%20xss%27)%22></img%20src%3d%22javascript:alert(%27%20xss%27)%22>bleach-2.1.2/tests/data/7.test000066400000000000000000000000031321226254300160450ustar00rootroot00000000000000"> bleach-2.1.2/tests/data/7.test.out000066400000000000000000000000051321226254300166550ustar00rootroot00000000000000">bleach-2.1.2/tests/data/8.test000066400000000000000000000000031321226254300160460ustar00rootroot00000000000000>" bleach-2.1.2/tests/data/8.test.out000066400000000000000000000000051321226254300166560ustar00rootroot00000000000000>"bleach-2.1.2/tests/data/9.test000066400000000000000000000000231321226254300160510ustar00rootroot00000000000000'';!--"=&{()} bleach-2.1.2/tests/data/9.test.out000066400000000000000000000000501321226254300166570ustar00rootroot00000000000000'';!--"<xss>=&{()}</xss>bleach-2.1.2/tests/test_callbacks.py000066400000000000000000000035111321226254300174240ustar00rootroot00000000000000from bleach.callbacks import nofollow, target_blank class TestNofollowCallback: def test_blank(self): attrs = {} assert nofollow(attrs) == attrs def test_no_href(self): attrs = {'_text': 'something something'} assert nofollow(attrs) == attrs def test_basic(self): attrs = {(None, 'href'): 'http://example.com'} assert ( nofollow(attrs) == {(None, 'href'): 'http://example.com', (None, 'rel'): 'nofollow'} ) def test_mailto(self): attrs = {(None, 'href'): 'mailto:joe@example.com'} assert nofollow(attrs) == attrs def test_has_nofollow_already(self): attrs = { (None, 'href'): 'http://example.com', (None, 'rel'): 'nofollow', } assert nofollow(attrs) == attrs def test_other_rel(self): attrs = { (None, 'href'): 'http://example.com', (None, 'rel'): 'next', } assert ( nofollow(attrs) == {(None, 'href'): 'http://example.com', (None, 'rel'): 'next nofollow'} ) class TestTargetBlankCallback: def test_empty(self): attrs = {} assert target_blank(attrs) == attrs def test_mailto(self): attrs = {(None, u'href'): u'mailto:joe@example.com'} assert target_blank(attrs) == attrs def test_add_target(self): attrs = {(None, u'href'): u'http://example.com'} assert ( target_blank(attrs) == {(None, u'href'): u'http://example.com', (None, u'target'): u'_blank'} ) def test_stomp_target(self): attrs = {(None, u'href'): u'http://example.com', (None, u'target'): u'foo'} assert ( target_blank(attrs) == {(None, u'href'): 'http://example.com', (None, u'target'): u'_blank'} ) bleach-2.1.2/tests/test_clean.py000066400000000000000000000302321321226254300165670ustar00rootroot00000000000000from html5lib.filters.base import Filter import pytest import bleach from bleach.sanitizer import Cleaner def test_empty(): assert bleach.clean('') == '' def test_nbsp(): assert bleach.clean(' test string ') == ' test string ' def test_comments_only(): comment = '' assert bleach.clean(comment) == '' assert bleach.clean(comment, strip_comments=False) == comment open_comment = ''.format(open_comment) ) def test_with_comments(): text = 'Just text' assert bleach.clean(text) == 'Just text' assert bleach.clean(text, strip_comments=False) == text def test_no_html(): assert bleach.clean('no html string') == 'no html string' def test_allowed_html(): assert ( bleach.clean('an allowed tag') == 'an allowed tag' ) assert ( bleach.clean('another good tag') == 'another good tag' ) def test_bad_html(): assert ( bleach.clean('a fixed tag') == 'a fixed tag' ) def test_function_arguments(): TAGS = ['span', 'br'] ATTRS = {'span': ['style']} text = 'a
test' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == 'a
test' ) def test_named_arguments(): ATTRS = {'a': ['rel', 'href']} text = '
xx.com' assert bleach.clean(text) == 'xx.com' assert ( bleach.clean(text, attributes=ATTRS) == 'xx.com' ) def test_disallowed_html(): assert ( bleach.clean('a test') == 'a <script>safe()</script> test' ) assert ( bleach.clean('a test') == 'a <style>body{}</style> test' ) def test_bad_href(): assert ( bleach.clean('no link') == 'no link' ) @pytest.mark.parametrize('text, expected', [ ('an & entity', 'an & entity'), ('an < entity', 'an < entity'), ('tag < and entity', 'tag < and entity'), ]) def test_bare_entities(text, expected): assert bleach.clean(text) == expected @pytest.mark.parametrize('text, expected', [ # Test character entities ('&', '&'), (' ', ' '), ('<em>strong</em>', '<em>strong</em>'), # Test character entity at beginning of string ('&is cool', '&is cool'), # Test it at the end of the string ('cool &', 'cool &'), # Test bare ampersands and entities at beginning ('&& is cool', '&& is cool'), # Test entities and bare ampersand at end ('& is cool &&', '& is cool &&'), # Test missing semi-colon means we don't treat it like an entity ('this & that', 'this &amp that'), # Test a thing that looks like a character entity, but isn't because it's # missing a ; (¤) ( 'http://example.com?active=true¤t=true', 'http://example.com?active=true&current=true' ), # Test entities in HTML attributes ( 'foo', 'foo' ), ( 'foo', 'foo' ), ( 'foo', 'foo' ), # Test numeric entities (''', '''), ('"', '"'), ('{', '{'), ('{', '{'), ('{', '{'), # Test non-numeric entities ('&#', '&#'), ('&#<', '&#<') ]) def test_character_entities(text, expected): assert bleach.clean(text) == expected def test_weird_strings(): s = 'with html tags' assert ( bleach.clean(text, strip=True) == 'a test with html tags' ) text = 'a test with html tags' assert ( bleach.clean(text, strip=True) == 'a test with html tags' ) text = '

link text

' assert ( bleach.clean(text, tags=['p'], strip=True) == '

link text

' ) text = '

multiply nested text

' assert ( bleach.clean(text, tags=['p'], strip=True) == '

multiply nested text

' ) text = '

' assert ( bleach.clean(text, tags=['p', 'a'], strip=True) == '

' ) def test_allowed_styles(): ATTRS = ['style'] STYLE = ['color'] assert ( bleach.clean('', attributes=ATTRS) == '' ) text = '' assert bleach.clean(text, attributes=ATTRS, styles=STYLE) == text text = '' assert ( bleach.clean(text, attributes=ATTRS, styles=STYLE) == '' ) def test_lowercase_html(): """We should output lowercase HTML.""" assert ( bleach.clean('BAR', attributes=['class']) == 'BAR' ) def test_attributes_callable(): """Verify attributes can take a callable""" ATTRS = lambda tag, name, val: name == 'title' TAGS = ['a'] text = u'example' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == u'example' ) def test_attributes_wildcard(): """Verify attributes[*] works""" ATTRS = { '*': ['id'], 'img': ['src'], } TAGS = ['img', 'em'] text = 'both can have ' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == 'both can have ' ) def test_attributes_wildcard_callable(): """Verify attributes[*] callable works""" ATTRS = { '*': lambda tag, name, val: name == 'title' } TAGS = ['a'] assert ( bleach.clean(u'example', tags=TAGS, attributes=ATTRS) == u'example' ) def test_attributes_tag_callable(): """Verify attributes[tag] callable works""" def img_test(tag, name, val): return name == 'src' and val.startswith('https') ATTRS = { 'img': img_test, } TAGS = ['img'] text = 'foo blah baz' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == u'foo baz' ) text = 'foo blah baz' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == u'foo baz' ) def test_attributes_tag_list(): """Verify attributes[tag] list works""" ATTRS = { 'a': ['title'] } TAGS = ['a'] assert ( bleach.clean(u'example', tags=TAGS, attributes=ATTRS) == u'example' ) def test_attributes_list(): """Verify attributes list works""" ATTRS = ['title'] TAGS = ['a'] text = u'example' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == u'example' ) def test_svg_attr_val_allows_ref(): """Unescape values in svg attrs that allow url references""" # Local IRI, so keep it TAGS = ['svg', 'rect'] ATTRS = { 'rect': ['fill'], } text = '' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == '' ) # Non-local IRI, so drop it TAGS = ['svg', 'rect'] ATTRS = { 'rect': ['fill'], } text = '' assert ( bleach.clean(text, tags=TAGS, attributes=ATTRS) == '' ) @pytest.mark.parametrize('text, expected', [ ( '', '' ), ( '', # NOTE(willkg): Bug in html5lib serializer drops the xlink part '' ), ]) def test_svg_allow_local_href(text, expected): """Keep local hrefs for svg elements""" TAGS = ['svg', 'pattern'] ATTRS = { 'pattern': ['id', 'href'], } assert bleach.clean(text, tags=TAGS, attributes=ATTRS) == expected @pytest.mark.parametrize('text, expected', [ ( '', '' ), ( '', '' ), ]) def test_svg_allow_local_href_nonlocal(text, expected): """Drop non-local hrefs for svg elements""" TAGS = ['svg', 'pattern'] ATTRS = { 'pattern': ['id', 'href'], } assert bleach.clean(text, tags=TAGS, attributes=ATTRS) == expected @pytest.mark.xfail(reason='html5lib >= 0.99999999: changed API') def test_sarcasm(): """Jokes should crash.""" dirty = 'Yeah right ' clean = 'Yeah right <sarcasm/>' assert bleach.clean(dirty) == clean def test_user_defined_protocols_valid(): valid_href = 'allowed href' assert bleach.clean(valid_href, protocols=['myprotocol']) == valid_href def test_user_defined_protocols_invalid(): invalid_href = 'invalid href' cleaned_href = 'invalid href' assert bleach.clean(invalid_href, protocols=['my_protocol']) == cleaned_href def test_filters(): # Create a Filter that changes all the attr values to "moo" class MooFilter(Filter): def __iter__(self): for token in Filter.__iter__(self): if token['type'] in ['StartTag', 'EmptyTag'] and token['data']: for attr, value in token['data'].items(): token['data'][attr] = 'moo' yield token ATTRS = { 'img': ['rel', 'src'] } TAGS = ['img'] cleaner = Cleaner(tags=TAGS, attributes=ATTRS, filters=[MooFilter]) dirty = 'this is cute! ' assert ( cleaner.clean(dirty) == 'this is cute! ' ) def test_clean_idempotent(): """Make sure that applying the filter twice doesn't change anything.""" dirty = 'invalid & < extra http://link.com' assert bleach.clean(bleach.clean(dirty)) == bleach.clean(dirty) def test_only_text_is_cleaned(): some_text = 'text' some_type = int no_type = None assert bleach.clean(some_text) == some_text with pytest.raises(TypeError) as e: bleach.clean(some_type) assert "argument cannot be of 'type' type" in str(e) with pytest.raises(TypeError) as e: bleach.clean(no_type) assert "NoneType" in str(e) class TestCleaner: def test_basics(self): TAGS = ['span', 'br'] ATTRS = {'span': ['style']} cleaner = Cleaner(tags=TAGS, attributes=ATTRS) assert ( cleaner.clean('a
test') == 'a
test' ) bleach-2.1.2/tests/test_css.py000066400000000000000000000104501321226254300162750ustar00rootroot00000000000000from functools import partial import pytest from bleach import clean clean = partial(clean, tags=['p'], attributes=['style']) @pytest.mark.parametrize('data, styles, expected', [ ( 'font-family: Arial; color: red; float: left; background-color: red;', ['color'], 'color: red;' ), ( 'border: 1px solid blue; color: red; float: left;', ['color'], 'color: red;' ), ( 'border: 1px solid blue; color: red; float: left;', ['color', 'float'], 'color: red; float: left;' ), ( 'color: red; float: left; padding: 1em;', ['color', 'float'], 'color: red; float: left;' ), ( 'color: red; float: left; padding: 1em;', ['color'], 'color: red;' ), ( 'cursor: -moz-grab;', ['cursor'], 'cursor: -moz-grab;' ), ( 'color: hsl(30,100%,50%);', ['color'], 'color: hsl(30,100%,50%);' ), ( 'color: rgba(255,0,0,0.4);', ['color'], 'color: rgba(255,0,0,0.4);' ), ( "text-overflow: ',' ellipsis;", ['text-overflow'], "text-overflow: ',' ellipsis;" ), ( 'text-overflow: "," ellipsis;', ['text-overflow'], 'text-overflow: "," ellipsis;' ), ( 'font-family: "Arial";', ['font-family'], 'font-family: "Arial";' ), ]) def test_allowed_css(data, styles, expected): p_single = '

bar

' p_double = "

bar

" if '"' in data: assert clean(p_double.format(data), styles=styles) == p_double.format(expected) else: assert clean(p_single.format(data), styles=styles) == p_single.format(expected) def test_valid_css(): """The sanitizer should fix missing CSS values.""" styles = ['color', 'float'] assert ( clean('

foo

', styles=styles) == '

foo

' ) assert ( clean('

foo

', styles=styles) == '

foo

' ) def test_style_hang(): """The sanitizer should not hang on any inline styles""" style = [ 'margin-top: 0px;', 'margin-right: 0px;', 'margin-bottom: 1.286em;', 'margin-left: 0px;', 'padding-top: 15px;', 'padding-right: 15px;', 'padding-bottom: 15px;', 'padding-left: 15px;', 'border-top-width: 1px;', 'border-right-width: 1px;', 'border-bottom-width: 1px;', 'border-left-width: 1px;', 'border-top-style: dotted;', 'border-right-style: dotted;', 'border-bottom-style: dotted;', 'border-left-style: dotted;', 'border-top-color: rgb(203, 200, 185);', 'border-right-color: rgb(203, 200, 185);', 'border-bottom-color: rgb(203, 200, 185);', 'border-left-color: rgb(203, 200, 185);', 'background-image: initial;', 'background-attachment: initial;', 'background-origin: initial;', 'background-clip: initial;', 'background-color: rgb(246, 246, 242);', 'overflow-x: auto;', 'overflow-y: auto;', 'font: italic small-caps bolder condensed 16px/3 cursive;', 'background-position: initial initial;', 'background-repeat: initial initial;' ] html = '

Hello world

' % ' '.join(style) styles = [ 'border', 'float', 'overflow', 'min-height', 'vertical-align', 'white-space', 'margin', 'margin-left', 'margin-top', 'margin-bottom', 'margin-right', 'padding', 'padding-left', 'padding-top', 'padding-bottom', 'padding-right', 'background', 'background-color', 'font', 'font-size', 'font-weight', 'text-align', 'text-transform', ] expected = ( '

Hello world

' ) assert clean(html, styles=styles) == expected bleach-2.1.2/tests/test_linkify.py000066400000000000000000000464361321226254300171670ustar00rootroot00000000000000import re try: from urllib.parse import quote_plus except ImportError: from urllib import quote_plus import pytest from bleach import linkify, DEFAULT_CALLBACKS as DC from bleach.linkifier import Linker def test_empty(): assert linkify('') == '' def test_simple_link(): assert ( linkify('a http://example.com link') == 'a http://example.com link' ) assert ( linkify('a https://example.com link') == 'a https://example.com link' ) assert ( linkify('a example.com link') == 'a example.com link' ) def test_trailing_slash(): assert ( linkify('http://examp.com/') == 'http://examp.com/' ) assert ( linkify('http://example.com/foo/') == 'http://example.com/foo/' ) assert ( linkify('http://example.com/foo/bar/') == 'http://example.com/foo/bar/' ) def test_mangle_link(): """We can muck with the href attribute of the link.""" def filter_url(attrs, new=False): if not attrs.get((None, 'href'), '').startswith('http://bouncer'): quoted = quote_plus(attrs[(None, 'href')]) attrs[(None, 'href')] = 'http://bouncer/?u={0!s}'.format(quoted) return attrs assert ( linkify('http://example.com', callbacks=DC + [filter_url]) == 'http://example.com' ) def test_mangle_text(): """We can muck with the inner text of a link.""" def ft(attrs, new=False): attrs['_text'] = 'bar' return attrs assert ( linkify('http://ex.mp foo', callbacks=[ft]) == 'bar bar' ) @pytest.mark.parametrize('data,parse_email,expected', [ ( 'a james@example.com mailto', False, 'a james@example.com mailto' ), ( 'a james@example.com.au mailto', False, 'a james@example.com.au mailto' ), ( 'a james@example.com mailto', True, 'a james@example.com mailto' ), ( 'aussie james@example.com.au mailto', True, 'aussie james@example.com.au mailto' ), # This is kind of a pathological case. I guess we do our best here. ( 'email to james@example.com', True, 'email to james@example.com' ), ( '
jinkyun@example.com', True, '
jinkyun@example.com' ), # Mailto links at the end of a sentence. ( 'mailto james@example.com.au.', True, 'mailto james@example.com.au.' ), # Incorrect email ( '"\\\n"@opa.ru', True, '"\\\n"@opa.ru' ), ]) def test_email_link(data, parse_email, expected): assert linkify(data, parse_email=parse_email) == expected @pytest.mark.parametrize('data,expected', [ ( '"james"@example.com', '''"james"@example.com''' ), ( '"j\'ames"@example.com', '''"j'ames"@example.com''' ), ( '"ja>mes"@example.com', '''"ja>mes"@example.com''' ), ]) def test_email_link_escaping(data, expected): assert linkify(data, parse_email=True) == expected def no_new_links(attrs, new=False): if new: return None return attrs def no_old_links(attrs, new=False): if not new: return None return attrs def noop(attrs, new=False): return attrs @pytest.mark.parametrize('callback,expected', [ ( [noop], 'a ex.mp example' ), ( [no_new_links, noop], 'a ex.mp example' ), ( [noop, no_new_links], 'a ex.mp example' ), ( [no_old_links, noop], 'a ex.mp example' ), ( [noop, no_old_links], 'a ex.mp example' ), ( [no_old_links, no_new_links], 'a ex.mp example' ) ]) def test_prevent_links(callback, expected): """Returning None from any callback should remove links or prevent them from being created.""" text = 'a ex.mp example' assert linkify(text, callbacks=callback) == expected def test_set_attrs(): """We can set random attributes on links.""" def set_attr(attrs, new=False): attrs[(None, u'rev')] = u'canonical' return attrs assert ( linkify('ex.mp', callbacks=[set_attr]) == 'ex.mp' ) def test_only_proto_links(): """Only create links if there's a protocol.""" def only_proto(attrs, new=False): if new and not attrs['_text'].startswith(('http:', 'https:')): return None return attrs in_text = 'a ex.mp http://ex.mp bar' assert ( linkify(in_text, callbacks=[only_proto]) == 'a ex.mp http://ex.mp bar' ) def test_stop_email(): """Returning None should prevent a link from being created.""" def no_email(attrs, new=False): if attrs[(None, 'href')].startswith('mailto:'): return None return attrs text = 'do not link james@example.com' assert linkify(text, parse_email=True, callbacks=[no_email]) == text @pytest.mark.parametrize('data,expected', [ # tlds ('example.com', 'example.com'), ('example.co', 'example.co'), ('example.co.uk', 'example.co.uk'), ('example.edu', 'example.edu'), ('example.xxx', 'example.xxx'), ('bit.ly/fun', 'bit.ly/fun'), # non-tlds ('example.yyy', 'example.yyy'), ('brie', 'brie'), ]) def test_tlds(data, expected): assert linkify(data) == expected def test_escaping(): assert linkify('< unrelated') == '< unrelated' def test_nofollow_off(): assert linkify('example.com', callbacks=[]) == 'example.com' def test_link_in_html(): assert ( linkify('http://yy.com') == 'http://yy.com' ) assert ( linkify('http://xx.com') == 'http://xx.com' ) def test_links_https(): assert ( linkify('https://yy.com') == 'https://yy.com' ) def test_add_rel_nofollow(): """Verify that rel="nofollow" is added to an existing link""" assert ( linkify('http://yy.com') == 'http://yy.com' ) def test_url_with_path(): assert ( linkify('http://example.com/path/to/file') == '' 'http://example.com/path/to/file' ) def test_link_ftp(): assert ( linkify('ftp://ftp.mozilla.org/some/file') == '' 'ftp://ftp.mozilla.org/some/file' ) def test_link_query(): assert ( linkify('http://xx.com/?test=win') == 'http://xx.com/?test=win' ) assert ( linkify('xx.com/?test=win') == 'xx.com/?test=win' ) assert ( linkify('xx.com?test=win') == 'xx.com?test=win' ) def test_link_fragment(): assert ( linkify('http://xx.com/path#frag') == 'http://xx.com/path#frag' ) def test_link_entities(): assert ( linkify('http://xx.com/?a=1&b=2') == 'http://xx.com/?a=1&b=2' ) def test_escaped_html(): """If I pass in escaped HTML, it should probably come out escaped.""" s = '<em>strong</em>' assert linkify(s) == s def test_link_http_complete(): assert ( linkify('https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f') == '' 'https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f' ) def test_non_url(): """document.vulnerable should absolutely not be linkified.""" s = 'document.vulnerable' assert linkify(s) == s def test_javascript_url(): """javascript: urls should never be linkified.""" s = 'javascript:document.vulnerable' assert linkify(s) == s def test_unsafe_url(): """Any unsafe char ({}[]<>, etc.) in the path should end URL scanning.""" assert ( linkify('All your{"xx.yy.com/grover.png"}base are') == 'All your{"xx.yy.com/grover.png"}' 'base are' ) def test_skip_tags(): """Skip linkification in skip tags""" simple = 'http://xx.com
http://xx.com
' linked = ('http://xx.com ' '
http://xx.com
') all_linked = ('http://xx.com ' '
http://xx.com'
                  '
') assert linkify(simple, skip_tags=['pre']) == linked assert linkify(simple) == all_linked already_linked = '
xx
' nofollowed = '
xx
' assert linkify(already_linked) == nofollowed assert linkify(already_linked, skip_tags=['pre']) == nofollowed assert ( linkify('
http://example.com
http://example.com', skip_tags=['pre']) == ( '
http://example.com
' 'http://example.com' ) ) def test_libgl(): """libgl.so.1 should not be linkified.""" s = 'libgl.so.1' assert linkify(s) == s @pytest.mark.parametrize('url,periods', [ ('example.com', '.'), ('example.com', '...'), ('ex.com/foo', '.'), ('ex.com/foo', '....'), ]) def test_end_of_sentence(url, periods): """example.com. should match.""" out = '{0!s}{1!s}' intxt = '{0!s}{1!s}' assert linkify(intxt.format(url, periods)) == out.format(url, periods) def test_end_of_clause(): """example.com/foo, shouldn't include the ,""" assert ( linkify('ex.com/foo, bar') == 'ex.com/foo, bar' ) @pytest.mark.xfail(reason='html5lib >= 0.99999999: changed API') def test_sarcasm(): """Jokes should crash.""" assert linkify('Yeah right ') == 'Yeah right <sarcasm/>' @pytest.mark.parametrize('data,expected_data', [ ( '(example.com)', ('(', 'example.com', 'example.com', ')') ), ( '(example.com/)', ('(', 'example.com/', 'example.com/', ')') ), ( '(example.com/foo)', ('(', 'example.com/foo', 'example.com/foo', ')') ), ( '(((example.com/))))', ('(((', 'example.com/', 'example.com/', '))))') ), ( 'example.com/))', ('', 'example.com/', 'example.com/', '))') ), ( '(foo http://example.com/)', ('(foo ', 'example.com/', 'http://example.com/', ')') ), ( '(foo http://example.com)', ('(foo ', 'example.com', 'http://example.com', ')') ), ( 'http://en.wikipedia.org/wiki/Test_(assessment)', ('', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', '') ), ( '(http://en.wikipedia.org/wiki/Test_(assessment))', ('(', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', ')') ), ( '((http://en.wikipedia.org/wiki/Test_(assessment))', ('((', 'en.wikipedia.org/wiki/Test_(assessment', 'http://en.wikipedia.org/wiki/Test_(assessment', '))') ), ( '(http://en.wikipedia.org/wiki/Test_(assessment)))', ('(', 'en.wikipedia.org/wiki/Test_(assessment))', 'http://en.wikipedia.org/wiki/Test_(assessment))', ')') ), ( '(http://en.wikipedia.org/wiki/)Test_(assessment', ('(', 'en.wikipedia.org/wiki/)Test_(assessment', 'http://en.wikipedia.org/wiki/)Test_(assessment', '') ), ( 'hello (http://www.mu.de/blah.html) world', ('hello (', 'www.mu.de/blah.html', 'http://www.mu.de/blah.html', ') world') ), ( 'hello (http://www.mu.de/blah.html). world', ('hello (', 'www.mu.de/blah.html', 'http://www.mu.de/blah.html', '). world') ) ]) def test_wrapping_parentheses(data, expected_data): """URLs wrapped in parantheses should not include them.""" out = '{0!s}{2!s}{3!s}' assert linkify(data) == out.format(*expected_data) def test_parentheses_with_removing(): expected = '(test.py)' assert linkify(expected, callbacks=[lambda *a: None]) == expected @pytest.mark.parametrize('data,expected_data', [ # Test valid ports ('http://foo.com:8000', ('http://foo.com:8000', '')), ('http://foo.com:8000/', ('http://foo.com:8000/', '')), # Test non ports ('http://bar.com:xkcd', ('http://bar.com', ':xkcd')), ('http://foo.com:81/bar', ('http://foo.com:81/bar', '')), ('http://foo.com:', ('http://foo.com', ':')), # Test non-ascii ports ('http://foo.com:\u0663\u0669/', ('http://foo.com', ':\u0663\u0669/')), ('http://foo.com:\U0001d7e0\U0001d7d8/', ('http://foo.com', ':\U0001d7e0\U0001d7d8/')), ]) def test_ports(data, expected_data): """URLs can contain port numbers.""" out = '{0}{1}' assert linkify(data) == out.format(*expected_data) def test_ignore_bad_protocols(): assert ( linkify('foohttp://bar') == 'foohttp://bar' ) assert ( linkify('fohttp://exampl.com') == 'fohttp://exampl.com' ) def test_link_emails_and_urls(): """parse_email=True shouldn't prevent URLs from getting linkified.""" assert ( linkify('http://example.com person@example.com', parse_email=True) == ( '' 'http://example.com ' 'person@example.com' ) ) def test_links_case_insensitive(): """Protocols and domain names are case insensitive.""" expect = 'HTTP://EXAMPLE.COM' assert linkify('HTTP://EXAMPLE.COM') == expect def test_elements_inside_links(): assert ( linkify('hello
') == 'hello
' ) assert ( linkify('bold hello
') == 'bold hello
' ) def test_drop_link_tags(): """Verify that dropping link tags *just* drops the tag and not the content""" html = ( 'first second third ' 'fourth fifth' ) assert ( linkify(html, callbacks=[lambda attrs, new: None]) == 'first second third fourth fifth' ) @pytest.mark.parametrize('text, expected', [ (u'<br>', u'<br>'), ( u'<br> http://example.com', u'<br> http://example.com' ), ( u'<br>
http://example.com', u'<br>
http://example.com' ) ]) def test_naughty_unescaping(text, expected): """Verify that linkify is not unescaping things it shouldn't be""" assert linkify(text) == expected def test_hang(): """This string would hang linkify. Issue #200""" assert ( linkify("an@email.com", parse_email=True) == 'an@email.com' ) def test_hyphen_in_mail(): """Test hyphens `-` in mails. Issue #300.""" assert ( linkify('ex@am-ple.com', parse_email=True) == 'ex@am-ple.com' ) def test_url_re_arg(): """Verifies that a specified url_re is used""" fred_re = re.compile(r"""(fred\.com)""") linker = Linker(url_re=fred_re) assert ( linker.linkify('a b c fred.com d e f') == 'a b c fred.com d e f' ) assert ( linker.linkify('a b c http://example.com d e f') == 'a b c http://example.com d e f' ) def test_email_re_arg(): """Verifies that a specified email_re is used""" fred_re = re.compile(r"""(fred@example\.com)""") linker = Linker(parse_email=True, email_re=fred_re) assert ( linker.linkify('a b c fred@example.com d e f') == 'a b c fred@example.com d e f' ) assert ( linker.linkify('a b c jim@example.com d e f') == 'a b c jim@example.com d e f' ) def test_linkify_idempotent(): dirty = 'invalid & < extra http://link.com' assert linkify(linkify(dirty)) == linkify(dirty) class TestLinkify: def test_no_href_links(self): s = 'x' assert linkify(s) == s def test_rel_already_there(self): """Make sure rel attribute is updated not replaced""" linked = ('Click ' 'here.') link_good = 'Click here.' assert linkify(linked) == link_good assert linkify(link_good) == link_good def test_only_text_is_linkified(self): some_text = 'text' some_type = int no_type = None assert linkify(some_text) == some_text with pytest.raises(TypeError): linkify(some_type) with pytest.raises(TypeError): linkify(no_type) bleach-2.1.2/tests/test_security.py000066400000000000000000000133271321226254300173620ustar00rootroot00000000000000"""More advanced security tests""" import os import pytest import six from bleach import clean def test_escaped_entities(): # html5lib unescapes character entities, so these would become ' and " # which makes it possible to break out of html attributes. # # Verify that bleach.clean() doesn't unescape entities. assert ( clean(''"') == ''"' ) def test_nested_script_tag(): assert ( clean('</script>') == '<<script>script>evil()<</script>/script>' ) assert ( clean('<script>evil()</script>') == '<<x>script>evil()<</x>/script>' ) def test_nested_script_tag_r(): assert ( clean('>evil()>') == '<script<script>>evil()></script<script>' ) def test_invalid_attr(): IMG = ['img', ] IMG_ATTR = ['src'] assert ( clean('test') == 'test' ) assert ( clean('', tags=IMG, attributes=IMG_ATTR) == '' ) assert ( clean('', tags=IMG, attributes=IMG_ATTR) == '' ) def test_unquoted_attr(): assert ( clean('myabbr') == 'myabbr' ) def test_unquoted_event_handler(): assert ( clean('xx.com') == 'xx.com' ) def test_invalid_attr_value(): assert ( clean('') == '<img src="javascript:alert(\'XSS\');">' ) def test_invalid_href_attr(): assert ( clean('xss') == 'xss' ) def test_invalid_filter_attr(): IMG = ['img', ] IMG_ATTR = { 'img': lambda tag, name, val: name == 'src' and val == "http://example.com/" } assert ( clean('', tags=IMG, attributes=IMG_ATTR) == '' ) assert ( clean('', tags=IMG, attributes=IMG_ATTR) == '' ) def test_invalid_tag_char(): assert ( clean('') in [ '<script src="http://xx.com/xss.js" xss=""></script>', '<script xss="" src="http://xx.com/xss.js"></script>' ] ) assert ( clean('') == '<script src="http://xx.com/xss.js"></script>' ) def test_unclosed_tag(): assert ( clean('ipt>' assert clean(s, strip=True) == 'pt>alert(1)ipt>' s = 'pt>pt>alert(1)' assert clean(s, strip=True) == 'pt>pt>alert(1)' def test_poster_attribute(): """Poster attributes should not allow javascript.""" tags = ['video'] attrs = {'video': ['poster']} test = '' assert clean(test, tags=tags, attributes=attrs) == '' ok = '' assert clean(ok, tags=tags, attributes=attrs) == ok def test_feed_protocol(): assert clean('foo') == 'foo' @pytest.mark.parametrize('data, expected', [ # Convert bell ('1\a23', '1?23'), # Convert backpsace ('1\b23', '1?23'), # Convert formfeed ('1\v23', '1?23'), # Convert vertical tab ('1\f23', '1?23'), # Convert a bunch of characters in a string ('import y\bose\bm\bi\bt\be\b', 'import y?ose?m?i?t?e?'), ]) def test_invisible_characters(data, expected): assert clean(data) == expected def get_tests(): """Retrieves regression tests from data/ directory :returns: list of ``(filename, filedata)`` tuples """ datadir = os.path.join(os.path.dirname(__file__), 'data') tests = [ os.path.join(datadir, fn) for fn in os.listdir(datadir) if fn.endswith('.test') ] # Sort numerically which makes it easier to iterate through them tests.sort(key=lambda x: int(os.path.basename(x).split('.', 1)[0])) testcases = [ (fn, open(fn, 'r').read()) for fn in tests ] return testcases @pytest.mark.parametrize('fn, text', get_tests()) def test_regressions(fn, text): """Regression tests for clean so we can see if there are issues""" expected = six.text_type(open(fn + '.out', 'r').read()) # NOTE(willkg): This strips input and expected which makes it easier to # maintain the files. If there comes a time when the input needs whitespace # at the beginning or end, then we'll have to figure out something else. assert clean(text.strip()) == expected.strip() bleach-2.1.2/tests/test_unicode.py000066400000000000000000000035031321226254300171340ustar00rootroot00000000000000# -*- coding: utf-8 -*- from __future__ import unicode_literals import pytest from bleach import clean, linkify def test_japanese_safe_simple(): assert clean('ヘルプとチュートリアル') == 'ヘルプとチュートリアル' assert linkify('ヘルプとチュートリアル') == 'ヘルプとチュートリアル' def test_japanese_strip(): assert clean('ヘルプとチュートリアル') == 'ヘルプとチュートリアル' assert clean('ヘルプとチュートリアル') == '<span>ヘルプとチュートリアル</span>' def test_russian_simple(): assert clean('Домашняя') == 'Домашняя' assert linkify('Домашняя') == 'Домашняя' def test_mixed(): assert clean('Домашняяヘルプとチュートリアル') == 'Домашняяヘルプとチュートリアル' def test_mixed_linkify(): assert ( linkify('Домашняя http://example.com ヘルプとチュートリアル') == 'Домашняя http://example.com ヘルプとチュートリアル' ) @pytest.mark.parametrize('test,expected', [ ('http://éxámplé.com/', 'http://éxámplé.com/'), ('http://éxámplé.com/íàñá/', 'http://éxámplé.com/íàñá/'), ('http://éxámplé.com/íàñá/?foo=bar', 'http://éxámplé.com/íàñá/?foo=bar'), ('http://éxámplé.com/íàñá/?fóo=bár', 'http://éxámplé.com/íàñá/?fóo=bár'), ]) def test_url_utf8(test, expected): """Allow UTF8 characters in URLs themselves.""" outs = ('{0!s}', '{0!s}') out = lambda url: [x.format(url) for x in outs] expected = out(expected) assert linkify(test) in expected bleach-2.1.2/tests/test_utils.py000066400000000000000000000021351321226254300166460ustar00rootroot00000000000000from collections import OrderedDict from bleach.utils import alphabetize_attributes class TestAlphabeticalAttributes: def test_empty_cases(self): assert alphabetize_attributes(None) is None assert alphabetize_attributes({}) == {} def test_ordering(self): assert ( alphabetize_attributes({ (None, 'a'): 1, (None, 'b'): 2 }) == OrderedDict([ ((None, 'a'), 1), ((None, 'b'), 2) ]) ) assert ( alphabetize_attributes({ (None, 'b'): 1, (None, 'a'): 2} ) == OrderedDict([ ((None, 'a'), 2), ((None, 'b'), 1) ]) ) def test_different_namespaces(self): assert ( alphabetize_attributes({ ('xlink', 'href'): 'abc', (None, 'alt'): '123' }) == OrderedDict([ ((None, 'alt'), '123'), (('xlink', 'href'), 'abc') ]) ) bleach-2.1.2/tests_website/000077500000000000000000000000001321226254300156165ustar00rootroot00000000000000bleach-2.1.2/tests_website/.gitignore000066400000000000000000000000171321226254300176040ustar00rootroot00000000000000testcases.json bleach-2.1.2/tests_website/README.rst000066400000000000000000000011611321226254300173040ustar00rootroot00000000000000============ Test website ============ This holds infrastructure for running Bleach regression tests in a browser. Usage ===== From the repository root: 1. Generate the test cases:: python tests_website/data_to_json.py tests/data > tests_website/testcases.json 2. Run the test server as a background process:: cd tests_website && python server.py & You could also run it in a separate terminal by omitting the ``&`` at the end. 3. Open the page in browsers Python can find on your machine:: python tests_website/open_test_page.py 4. Go through the web pages and inspect the bleached HTML. bleach-2.1.2/tests_website/data_to_json.py000077500000000000000000000026571321226254300206510ustar00rootroot00000000000000#!/usr/bin/env python """ Util to write a directory of test cases with input filenames .test and output filenames .test.out as JSON to stdout. example: python tests/data_to_json.py tests/data > testcases.json """ import argparse import fnmatch import json import os import os.path import bleach def main(): parser = argparse.ArgumentParser(description=__doc__) parser.add_argument('data_dir', help='directory containing test cases with input files' ' named .test and output .test.out') args = parser.parse_args() filenames = os.listdir(args.data_dir) ins = [os.path.join(args.data_dir, f) for f in filenames if fnmatch.fnmatch(f, '*.test')] outs = [os.path.join(args.data_dir, f) for f in filenames if fnmatch.fnmatch(f, '*.test.out')] testcases = [] for infn, outfn in zip(ins, outs): case_name = infn.rsplit('.test', 1)[0] with open(infn, 'r') as fin, open(outfn, 'r') as fout: payload = fin.read()[:-1] testcases.append({ "title": case_name, "input_filename": infn, "output_filename": outfn, "payload": payload, "actual": bleach.clean(payload), "expected": fout.read(), }) print(json.dumps(testcases, indent=4, sort_keys=True)) if __name__ == '__main__': main() bleach-2.1.2/tests_website/index.html000066400000000000000000000153311321226254300176160ustar00rootroot00000000000000 Python Bleach 2.0.0

Python Bleach 2.0.0

pypi version Build Status

Demo

This is the demo for Bleach, a whitelist-based HTML sanitizing library that escapes or strips markup and attributes. Enter a sample payload in the textarea below and watch it sanitize in the textarea and iframe below.


clean when dirty HTML changes

bleach-2.1.2/tests_website/open_test_page.py000077500000000000000000000012221321226254300211640ustar00rootroot00000000000000#!/usr/bin/env python import webbrowser TEST_BROWSERS = set([ # 'mozilla', 'firefox', # 'netscape', # 'galeon', # 'epiphany', # 'skipstone', # 'kfmclient', # 'konqueror', # 'kfm', # 'mosaic', # 'opera', # 'grail', # 'links', # 'elinks', # 'lynx', # 'w3m', 'windows-default', # 'macosx', 'safari', # 'google-chrome', 'chrome', # 'chromium', # 'chromium-browser', ]) REGISTERED_BROWSERS = set(webbrowser._browsers.keys()) if __name__ == '__main__': for b in TEST_BROWSERS & REGISTERED_BROWSERS: webbrowser.get(b).open_new_tab('http://localhost:8080') bleach-2.1.2/tests_website/server.py000077500000000000000000000025001321226254300174760ustar00rootroot00000000000000#!/usr/bin/env python """ Simple Test/Demo Server for running bleach.clean output on various desktops. Usage: python server.py """ import six import bleach PORT = 8080 class BleachCleanHandler(six.moves.SimpleHTTPServer.SimpleHTTPRequestHandler): def do_POST(self): if six.PY2: content_len = int(self.headers.getheader('content-length', 0)) else: content_len = int(self.headers.get('content-length', 0)) body = self.rfile.read(content_len) print("read %s bytes: %s" % (content_len, body)) if six.PY3: body = body.decode('utf-8') print('input: %r' % body) cleaned = bleach.clean(body) self.send_response(200) self.send_header('Content-Length', len(cleaned)) self.send_header('Content-Type', 'text/plain;charset=UTF-8') self.end_headers() if six.PY3: cleaned = bytes(cleaned, encoding='utf-8') print("cleaned: %r" % cleaned) self.wfile.write(cleaned) if __name__ == '__main__': # Prevent 'cannot bind to address' errors on restart six.moves.socketserver.TCPServer.allow_reuse_address = True httpd = six.moves.socketserver.TCPServer(('127.0.0.1', PORT), BleachCleanHandler) print("listening on localhost port %d" % PORT) httpd.serve_forever() bleach-2.1.2/tox.ini000066400000000000000000000031221321226254300142430ustar00rootroot00000000000000# Tox (http://tox.testrun.org/) is a tool for running tests # in multiple virtualenvs. This configuration file will run the # test suite on all supported python versions. To use it, "pip install tox" # and then run "tox" from this directory. [tox] envlist = py{27,33,34,35,36}-html5lib{99999999,999999999,10b9,10b10,101} pypy-html5lib99999999 py{27,33,34,35,36}-build-no-lang docs lint [testenv] basepython = py27: python2.7 py33: python3.3 py34: python3.4 py35: python3.5 py36: python3.6 deps = -rrequirements.txt html5lib99999999: html5lib==0.99999999 html5lib999999999: html5lib==0.999999999 html5lib10b9: html5lib==1.0b9 html5lib10b10: html5lib==1.0b10 html5lib101: html5lib==1.0.1 commands = py.test {posargs:-v} python setup.py build [testenv:py27-build-no-lang] basepython = python2.7 setenv = LANG= commands = python setup.py build [testenv:py33-build-no-lang] basepython = python3.3 setenv = LANG= commands = python setup.py build [testenv:py34-build-no-lang] basepython = python3.4 setenv = LANG= commands = python setup.py build [testenv:py35-build-no-lang] basepython = python3.5 setenv = LANG= commands = python setup.py build [testenv:py36-build-no-lang] basepython = python3.6 setenv = LANG= commands = python setup.py build [testenv:lint] basepython = python deps = -rrequirements.txt commands = flake8 bleach/ [testenv:docs] basepython = python changedir = docs deps = -rrequirements.txt commands = sphinx-build -b html -d {envtmpdir}/doctrees . {envtmpdir}/html