pax_global_header00006660000000000000000000000064126310343070014511gustar00rootroot0000000000000052 comment=11b8160e584470439c8c0b3ab51012c9300f6788 python-bleach-1.4.2/000077500000000000000000000000001263103430700142525ustar00rootroot00000000000000python-bleach-1.4.2/.gitignore000066400000000000000000000001061263103430700162370ustar00rootroot00000000000000*.pyo *.pyc pip-log.txt .coverage dist *.egg-info .noseids build .tox python-bleach-1.4.2/.travis.yml000066400000000000000000000002631263103430700163640ustar00rootroot00000000000000sudo: false language: python python: - "2.6" - "2.7" - "3.2" - "3.3" - "3.4" - "pypy" install: - "pip install -r requirements.txt" script: - nosetests - flake8 bleach/ python-bleach-1.4.2/CHANGES000066400000000000000000000052451263103430700152530ustar00rootroot00000000000000Bleach Changes ============== Version 1.4.2 ------------- - Fix hang in linkify with parse_email=True. #124 - Fix crash in linkify when removing a link that is a first-child. #136 - Updated TLDs. - Don't remove exterior brackets when linkifying. #146 Version 1.4.1 ------------- - Consistent order of attributes in output. - Python 3.4. Version 1.4 ----------- - Update linkify to use etree type Treewalker instead of simpletree. - Updated html5lib to version >= 0.999. - Update all code to be compatible with Python 3 and 2 using six. - Switch to Apache License. Version 1.3 ----------- - Used by Python 3-only fork. Version 1.2.2 ------------- - Pin html5lib to version 0.95 for now due to major API break. Version 1.2.1 ------------- - clean() no longer considers "feed:" an acceptable protocol due to inconsistencies in browser behavior. Version 1.2 ----------- - linkify() has changed considerably. Many keyword arguments have been replaced with a single callbacks list. Please see the documentation for more information. - Bleach will no longer consider unacceptable protocols when linkifying. - linkify() now takes a tokenizer argument that allows it to skip sanitization. - delinkify() is gone. - Removed exception handling from _render. clean() and linkify() may now throw. - linkify() correctly ignores case for protocols and domain names. - linkify() correctly handles markup within an tag. Version 1.1.3 ------------- - Fix parsing bare URLs when parse_email=True. Version 1.1.2 ------------- - Fix hang in style attribute sanitizer. (#61) - Allow '/' in style attribute values. Version 1.1.1 ------------- - Fix tokenizer for html5lib 0.9.5. Version 1.1.0 ------------- - linkify() now understands port numbers. (#38) - Documented character encoding behavior. (#41) - Add an optional target argument to linkify(). - Add delinkify() method. (#45) - Support subdomain whitelist for delinkify(). (#47, #48) Version 1.0.4 ------------- - Switch to SemVer git tags. - Make linkify() smarter about trailing punctuation. (#30) - Pass exc_info to logger during rendering issues. - Add wildcard key for attributes. (#19) - Make linkify() use the HTMLSanitizer tokenizer. (#36) - Fix URLs wrapped in parentheses. (#23) - Make linkify() UTF-8 safe. (#33) Version 1.0.3 ------------- - linkify() works with 3rd level domains. (#24) - clean() supports vendor prefixes in style values. (#31, #32) - Fix linkify() email escaping. Version 1.0.2 ------------- - linkify() supports email addresses. - clean() supports callables in attributes filter. Version 1.0.1 ------------- - linkify() doesn't drop trailing slashes. (#21) - linkify() won't linkify 'libgl.so.1'. (#22) python-bleach-1.4.2/CONTRIBUTING.rst000066400000000000000000000004701263103430700167140ustar00rootroot00000000000000Reporting Security Issues ========================= If you believe you have found an exploit in a patched version of Bleach, master or the latest released version on PyPI, **please do not post it in a GitHub issue**. Please contact me privately, at `me+bleach@jamessocol.com `. python-bleach-1.4.2/CONTRIBUTORS000066400000000000000000000007021263103430700161310ustar00rootroot00000000000000Bleach is written and maintained by James Socol and various contributors within and without the Mozilla Corporation and Foundation. Lead Developer: - James Socol Contributors: - Jeff Balogh - Ricky Rosario - Chris Beaven - Luis Nell Patches: - Les Orchard - Paul Craciunoiu - Sébastien Fievet - TimothyFitz - Adrian "ThiefMaster" - Adam Lofts - Anton Kovalyov - Mark Paschal - Alex Ehlke - Marc DM - mdxs - Marc Abramowitz python-bleach-1.4.2/LICENSE000066400000000000000000000010641263103430700152600ustar00rootroot00000000000000Copyright (c) 2014, Mozilla Foundation Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. python-bleach-1.4.2/MANIFEST.in000066400000000000000000000000431263103430700160050ustar00rootroot00000000000000include LICENSE include README.rst python-bleach-1.4.2/README.rst000066400000000000000000000050401263103430700157400ustar00rootroot00000000000000====== Bleach ====== .. image:: https://travis-ci.org/jsocol/bleach.png?branch=master :target: https://travis-ci.org/jsocol/bleach .. image:: https://badge.fury.io/py/Bleach.svg :target: http://badge.fury.io/py/Bleach Bleach is an HTML sanitizing library that escapes or strips markup and attributes based on a white list. Bleach can also linkify text safely, applying filters that Django's ``urlize`` filter cannot, and optionally setting ``rel`` attributes, even on links already in the text. Bleach is intended for sanitizing text from *untrusted* sources. If you find yourself jumping through hoops to allow your site administrators to do lots of things, you're probably outside the use cases. Either trust those users, or don't. Because it relies on html5lib_, Bleach is as good as modern browsers at dealing with weird, quirky HTML fragments. And *any* of Bleach's methods will fix unbalanced or mis-nested tags. The version on GitHub_ is the most up-to-date and contains the latest bug fixes. You can find full documentation on `ReadTheDocs`_. Reporting Security Issues ========================= If you believe you have found an exploit in a patched version of Bleach, master or the latest released version on PyPI, **please do not post it in a GitHub issue**. Please contact me privately, at `me+bleach@jamessocol.com `. Basic Use ========= The simplest way to use Bleach is: .. code-block:: python >>> import bleach >>> bleach.clean('an example') u'an <script>evil()</script> example' >>> bleach.linkify('an http://example.com url') u'an http://example.com url *NB*: Bleach always returns a ``unicode`` object, whether you give it a bytestring or a ``unicode`` object, but Bleach does not attempt to detect incoming character encodings, and will assume UTF-8. If you are using a different character encoding, you should convert from a bytestring to ``unicode`` before passing the text to Bleach. Installation ------------ Bleach is available on PyPI_, so you can install it with ``pip``:: $ pip install bleach Or with ``easy_install``:: $ easy_install bleach Or by cloning the repo from GitHub_:: $ git clone git://github.com/jsocol/bleach.git Then install it by running:: $ python setup.py install .. _html5lib: https://github.com/html5lib/html5lib-python .. _GitHub: https://github.com/jsocol/bleach .. _ReadTheDocs: http://bleach.readthedocs.org/ .. _PyPI: http://pypi.python.org/pypi/bleach python-bleach-1.4.2/bleach/000077500000000000000000000000001263103430700154705ustar00rootroot00000000000000python-bleach-1.4.2/bleach/__init__.py000066400000000000000000000317661263103430700176160ustar00rootroot00000000000000# -*- coding: utf-8 -*- from __future__ import unicode_literals import logging import re import html5lib from html5lib.sanitizer import HTMLSanitizer from html5lib.serializer.htmlserializer import HTMLSerializer from . import callbacks as linkify_callbacks from .encoding import force_unicode from .sanitizer import BleachSanitizer VERSION = (1, 4, 2) __version__ = '.'.join([str(n) for n in VERSION]) __all__ = ['clean', 'linkify'] log = logging.getLogger('bleach') ALLOWED_TAGS = [ 'a', 'abbr', 'acronym', 'b', 'blockquote', 'code', 'em', 'i', 'li', 'ol', 'strong', 'ul', ] ALLOWED_ATTRIBUTES = { 'a': ['href', 'title'], 'abbr': ['title'], 'acronym': ['title'], } ALLOWED_STYLES = [] TLDS = """ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bv bw by bz ca cat cc cd cf cg ch ci ck cl cm cn co com coop cr cu cv cx cy cz de dj dk dm do dz ec edu ee eg er es et eu fi fj fk fm fo fr ga gb gd ge gf gg gh gi gl gm gn gov gp gq gr gs gt gu gw gy hk hm hn hr ht hu id ie il im in info int io iq ir is it je jm jo jobs jp ke kg kh ki km kn kp kr kw ky kz la lb lc li lk lr ls lt lu lv ly ma mc md me mg mh mil mk ml mm mn mo mobi mp mq mr ms mt mu museum mv mw mx my mz na name nc ne net nf ng ni nl no np nr nu nz om org pa pe pf pg ph pk pl pm pn post pr pro ps pt pw py qa re ro rs ru rw sa sb sc sd se sg sh si sj sk sl sm sn so sr ss st su sv sx sy sz tc td tel tf tg th tj tk tl tm tn to tp tr travel tt tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xn xxx ye yt yu za zm zw""".split() # Make sure that .com doesn't get matched by .co first TLDS.reverse() PROTOCOLS = HTMLSanitizer.acceptable_protocols url_re = re.compile( r"""\(* # Match any opening parentheses. \b(?"]*)? # /path/zz (excluding "unsafe" chars from RFC 1738, # except for # and ~, which happen in practice) """.format('|'.join(PROTOCOLS), '|'.join(TLDS)), re.IGNORECASE | re.VERBOSE | re.UNICODE) proto_re = re.compile(r'^[\w-]+:/{0,3}', re.IGNORECASE) punct_re = re.compile(r'([\.,]+)$') email_re = re.compile( r"""(? tag replaced by the text within it adj = replace_nodes(tree, _text, node, current_child) current_child -= 1 # pull back current_child by 1 to scan the # new nodes again. else: text = force_unicode(attrs.pop('_text')) for attr_key, attr_val in attrs.items(): node.set(attr_key, attr_val) for n in reversed(list(node)): node.remove(n) text = parser.parseFragment(text) node.text = text.text for n in text: node.append(n) _seen.add(node) elif current_child >= 0: if node.tag == ETREE_TAG('pre') and skip_pre: linkify_nodes(node, False) elif not (node in _seen): linkify_nodes(node, True) current_child += 1 def email_repl(match): addr = match.group(0).replace('"', '"') link = { '_text': addr, 'href': 'mailto:{0!s}'.format(addr), } link = apply_callbacks(link, True) if link is None: return addr _href = link.pop('href') _text = link.pop('_text') repl = '{2!s}' attr = '{0!s}="{1!s}"' attribs = ' '.join(attr.format(k, v) for k, v in link.items()) return repl.format(_href, attribs, _text) def link_repl(match): url = match.group(0) open_brackets = close_brackets = 0 if url.startswith('('): _wrapping = strip_wrapping_parentheses(url) url, open_brackets, close_brackets = _wrapping end = '' m = re.search(punct_re, url) if m: end = m.group(0) url = url[0:m.start()] if re.search(proto_re, url): href = url else: href = ''.join(['http://', url]) link = { '_text': url, 'href': href, } link = apply_callbacks(link, True) if link is None: return '(' * open_brackets + url + ')' * close_brackets _text = link.pop('_text') _href = link.pop('href') repl = '{0!s}{3!s}{4!s}{5!s}' attr = '{0!s}="{1!s}"' attribs = ' '.join(attr.format(k, v) for k, v in link.items()) return repl.format('(' * open_brackets, _href, attribs, _text, end, ')' * close_brackets) try: linkify_nodes(forest) except RuntimeError as e: # If we hit the max recursion depth, just return what we've got. log.exception('Probable recursion error: {0!r}'.format(e)) return _render(forest) def _render(tree): """Try rendering as HTML, then XML, then give up.""" return force_unicode(_serialize(tree)) def _serialize(domtree): walker = html5lib.treewalkers.getTreeWalker('etree') stream = walker(domtree) serializer = HTMLSerializer(quote_attr_values=True, alphabetical_attributes=True, omit_optional_tags=False) return serializer.render(stream) python-bleach-1.4.2/bleach/callbacks.py000066400000000000000000000010261263103430700177600ustar00rootroot00000000000000"""A set of basic callbacks for bleach.linkify.""" from __future__ import unicode_literals def nofollow(attrs, new=False): if attrs['href'].startswith('mailto:'): return attrs rel = [x for x in attrs.get('rel', '').split(' ') if x] if 'nofollow' not in [x.lower() for x in rel]: rel.append('nofollow') attrs['rel'] = ' '.join(rel) return attrs def target_blank(attrs, new=False): if attrs['href'].startswith('mailto:'): return attrs attrs['target'] = '_blank' return attrs python-bleach-1.4.2/bleach/encoding.py000066400000000000000000000043451263103430700176360ustar00rootroot00000000000000import datetime from decimal import Decimal import types import six def is_protected_type(obj): """Determine if the object instance is of a protected type. Objects of protected types are preserved as-is when passed to force_unicode(strings_only=True). """ return isinstance(obj, ( six.integer_types + (types.NoneType, datetime.datetime, datetime.date, datetime.time, float, Decimal)) ) def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'): """ Similar to smart_text, except that lazy instances are resolved to strings, rather than kept as lazy objects. If strings_only is True, don't convert (some) non-string-like objects. """ # Handle the common case first, saves 30-40% when s is an instance of # six.text_type. This function gets called often in that setting. if isinstance(s, six.text_type): return s if strings_only and is_protected_type(s): return s try: if not isinstance(s, six.string_types): if hasattr(s, '__unicode__'): s = s.__unicode__() else: if six.PY3: if isinstance(s, bytes): s = six.text_type(s, encoding, errors) else: s = six.text_type(s) else: s = six.text_type(bytes(s), encoding, errors) else: # Note: We use .decode() here, instead of six.text_type(s, # encoding, errors), so that if s is a SafeBytes, it ends up being # a SafeText at the end. s = s.decode(encoding, errors) except UnicodeDecodeError as e: if not isinstance(s, Exception): raise UnicodeDecodeError(*e.args) else: # If we get to here, the caller has passed in an Exception # subclass populated with non-ASCII bytestring data without a # working unicode method. Try to handle this without raising a # further exception by individually forcing the exception args # to unicode. s = ' '.join([force_unicode(arg, encoding, strings_only, errors) for arg in s]) return s python-bleach-1.4.2/bleach/sanitizer.py000066400000000000000000000145231263103430700200570ustar00rootroot00000000000000from __future__ import unicode_literals import re from xml.sax.saxutils import escape, unescape from html5lib.constants import tokenTypes from html5lib.sanitizer import HTMLSanitizerMixin from html5lib.tokenizer import HTMLTokenizer PROTOS = HTMLSanitizerMixin.acceptable_protocols PROTOS.remove('feed') class BleachSanitizerMixin(HTMLSanitizerMixin): """Mixin to replace sanitize_token() and sanitize_css().""" allowed_svg_properties = [] def sanitize_token(self, token): """Sanitize a token either by HTML-encoding or dropping. Unlike HTMLSanitizerMixin.sanitize_token, allowed_attributes can be a dict of {'tag': ['attribute', 'pairs'], 'tag': callable}. Here callable is a function with two arguments of attribute name and value. It should return true of false. Also gives the option to strip tags instead of encoding. """ if (getattr(self, 'wildcard_attributes', None) is None and isinstance(self.allowed_attributes, dict)): self.wildcard_attributes = self.allowed_attributes.get('*', []) if token['type'] in (tokenTypes['StartTag'], tokenTypes['EndTag'], tokenTypes['EmptyTag']): if token['name'] in self.allowed_elements: if 'data' in token: if isinstance(self.allowed_attributes, dict): allowed_attributes = self.allowed_attributes.get( token['name'], []) if not callable(allowed_attributes): allowed_attributes += self.wildcard_attributes else: allowed_attributes = self.allowed_attributes attrs = dict([(name, val) for name, val in token['data'][::-1] if (allowed_attributes(name, val) if callable(allowed_attributes) else name in allowed_attributes)]) for attr in self.attr_val_is_uri: if attr not in attrs: continue val_unescaped = re.sub("[`\000-\040\177-\240\s]+", '', unescape(attrs[attr])).lower() # Remove replacement characters from unescaped # characters. val_unescaped = val_unescaped.replace("\ufffd", "") if (re.match(r'^[a-z0-9][-+.a-z0-9]*:', val_unescaped) and (val_unescaped.split(':')[0] not in self.allowed_protocols)): del attrs[attr] for attr in self.svg_attr_val_allows_ref: if attr in attrs: attrs[attr] = re.sub(r'url\s*\(\s*[^#\s][^)]+?\)', ' ', unescape(attrs[attr])) if (token['name'] in self.svg_allow_local_href and 'xlink:href' in attrs and re.search(r'^\s*[^#\s].*', attrs['xlink:href'])): del attrs['xlink:href'] if 'style' in attrs: attrs['style'] = self.sanitize_css(attrs['style']) token['data'] = [(name, val) for name, val in attrs.items()] return token elif self.strip_disallowed_elements: pass else: if token['type'] == tokenTypes['EndTag']: token['data'] = ''.format(token['name']) elif token['data']: attr = ' {0!s}="{1!s}"' attrs = ''.join([attr.format(k, escape(v)) for k, v in token['data']]) token['data'] = '<{0!s}{1!s}>'.format(token['name'], attrs) else: token['data'] = '<{0!s}>'.format(token['name']) if token['selfClosing']: token['data'] = token['data'][:-1] + '/>' token['type'] = tokenTypes['Characters'] del token["name"] return token elif token['type'] == tokenTypes['Comment']: if not self.strip_html_comments: return token else: return token def sanitize_css(self, style): """HTMLSanitizerMixin.sanitize_css replacement. HTMLSanitizerMixin.sanitize_css always whitelists background-*, border-*, margin-*, and padding-*. We only whitelist what's in the whitelist. """ # disallow urls style = re.compile('url\s*\(\s*[^\s)]+?\s*\)\s*').sub(' ', style) # gauntlet # TODO: Make sure this does what it's meant to - I *think* it wants to # validate style attribute contents. parts = style.split(';') gauntlet = re.compile("""^([-/:,#%.'"\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'""" """\s*|"[\s\w]+"|\([\d,%\.\s]+\))*$""") for part in parts: if not gauntlet.match(part): return '' if not re.match("^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$", style): return '' clean = [] for prop, value in re.findall('([-\w]+)\s*:\s*([^:;]*)', style): if not value: continue if prop.lower() in self.allowed_css_properties: clean.append(prop + ': ' + value + ';') elif prop.lower() in self.allowed_svg_properties: clean.append(prop + ': ' + value + ';') return ' '.join(clean) class BleachSanitizer(HTMLTokenizer, BleachSanitizerMixin): def __init__(self, stream, encoding=None, parseMeta=True, useChardet=True, lowercaseElementName=True, lowercaseAttrName=True, **kwargs): HTMLTokenizer.__init__(self, stream, encoding, parseMeta, useChardet, lowercaseElementName, lowercaseAttrName, **kwargs) def __iter__(self): for token in HTMLTokenizer.__iter__(self): token = self.sanitize_token(token) if token: yield token python-bleach-1.4.2/bleach/tests/000077500000000000000000000000001263103430700166325ustar00rootroot00000000000000python-bleach-1.4.2/bleach/tests/__init__.py000066400000000000000000000000001263103430700207310ustar00rootroot00000000000000python-bleach-1.4.2/bleach/tests/test_basics.py000066400000000000000000000134661263103430700215210ustar00rootroot00000000000000import six import html5lib from nose.tools import eq_ import bleach from bleach.tests.tools import in_ def test_empty(): eq_('', bleach.clean('')) def test_nbsp(): if six.PY3: expected = '\xa0test string\xa0' else: expected = six.u('\\xa0test string\\xa0') eq_(expected, bleach.clean(' test string ')) def test_comments_only(): comment = '' open_comment = ''.format(open_comment), bleach.clean(open_comment, strip_comments=False)) def test_with_comments(): html = 'Just text' eq_('Just text', bleach.clean(html)) eq_(html, bleach.clean(html, strip_comments=False)) def test_no_html(): eq_('no html string', bleach.clean('no html string')) def test_allowed_html(): eq_('an allowed tag', bleach.clean('an allowed tag')) eq_('another good tag', bleach.clean('another good tag')) def test_bad_html(): eq_('a fixed tag', bleach.clean('a fixed tag')) def test_function_arguments(): TAGS = ['span', 'br'] ATTRS = {'span': ['style']} eq_('a
test', bleach.clean('a
test', tags=TAGS, attributes=ATTRS)) def test_named_arguments(): ATTRS = {'a': ['rel', 'href']} s = ('xx.com', 'xx.com') eq_('xx.com', bleach.clean(s[0])) in_(s, bleach.clean(s[0], attributes=ATTRS)) def test_disallowed_html(): eq_('a <script>safe()</script> test', bleach.clean('a test')) eq_('a <style>body{}</style> test', bleach.clean('a test')) def test_bad_href(): eq_('no link', bleach.clean('no link')) def test_bare_entities(): eq_('an & entity', bleach.clean('an & entity')) eq_('an < entity', bleach.clean('an < entity')) eq_('tag < and entity', bleach.clean('tag < and entity')) eq_('&', bleach.clean('&')) def test_escaped_entities(): s = '<em>strong</em>' eq_(s, bleach.clean(s)) def test_serializer(): s = '
' eq_(s, bleach.clean(s, tags=['table'])) eq_('test
', bleach.linkify('test
')) eq_('

test

', bleach.clean('

test

', tags=['p'])) def test_no_href_links(): s = 'x' eq_(s, bleach.linkify(s)) def test_weird_strings(): s = 'with
html tags', bleach.clean('a test with html tags', strip=True)) eq_('a test with html tags', bleach.clean('a test with ' 'html tags', strip=True)) s = '

link text

' eq_('

link text

', bleach.clean(s, tags=['p'], strip=True)) s = '

multiply nested text

' eq_('

multiply nested text

', bleach.clean(s, tags=['p'], strip=True)) s = ('

' '

') eq_('

', bleach.clean(s, tags=['p', 'a'], strip=True)) def test_allowed_styles(): ATTR = ['style'] STYLE = ['color'] blank = '' s = '' eq_(blank, bleach.clean('', attributes=ATTR)) eq_(s, bleach.clean(s, attributes=ATTR, styles=STYLE)) eq_(s, bleach.clean('', attributes=ATTR, styles=STYLE)) def test_idempotent(): """Make sure that applying the filter twice doesn't change anything.""" dirty = 'invalid & < extra http://link.com' clean = bleach.clean(dirty) eq_(clean, bleach.clean(clean)) linked = bleach.linkify(dirty) eq_(linked, bleach.linkify(linked)) def test_rel_already_there(): """Make sure rel attribute is updated not replaced""" linked = ('Click ' 'here.') link_good = (('Click ' 'here.'), ('Click ' 'here.')) in_(link_good, bleach.linkify(linked)) in_(link_good, bleach.linkify(link_good[0])) def test_lowercase_html(): """We should output lowercase HTML.""" dirty = 'BAR' clean = 'BAR' eq_(clean, bleach.clean(dirty, attributes=['class'])) def test_wildcard_attributes(): ATTR = { '*': ['id'], 'img': ['src'], } TAG = ['img', 'em'] dirty = ('both can have ' '') clean = ('both can have ', 'both can have ') in_(clean, bleach.clean(dirty, tags=TAG, attributes=ATTR)) def test_sarcasm(): """Jokes should crash.""" dirty = 'Yeah right ' clean = 'Yeah right <sarcasm/>' eq_(clean, bleach.clean(dirty)) python-bleach-1.4.2/bleach/tests/test_css.py000066400000000000000000000102261263103430700210340ustar00rootroot00000000000000from functools import partial from nose.tools import eq_ from bleach import clean clean = partial(clean, tags=['p'], attributes=['style']) def test_allowed_css(): tests = ( ('font-family: Arial; color: red; float: left; ' 'background-color: red;', 'color: red;', ['color']), ('border: 1px solid blue; color: red; float: left;', 'color: red;', ['color']), ('border: 1px solid blue; color: red; float: left;', 'color: red; float: left;', ['color', 'float']), ('color: red; float: left; padding: 1em;', 'color: red; float: left;', ['color', 'float']), ('color: red; float: left; padding: 1em;', 'color: red;', ['color']), ('cursor: -moz-grab;', 'cursor: -moz-grab;', ['cursor']), ('color: hsl(30,100%,50%);', 'color: hsl(30,100%,50%);', ['color']), ('color: rgba(255,0,0,0.4);', 'color: rgba(255,0,0,0.4);', ['color']), ("text-overflow: ',' ellipsis;", "text-overflow: ',' ellipsis;", ['text-overflow']), ('text-overflow: "," ellipsis;', 'text-overflow: "," ellipsis;', ['text-overflow']), ('font-family: "Arial";', 'font-family: "Arial";', ['font-family']), ) p_single = '

bar

' p_double = "

bar

" def check(i, o, s): if '"' in i: eq_(p_double.format(o), clean(p_double.format(i), styles=s)) else: eq_(p_single.format(o), clean(p_single.format(i), styles=s)) for i, o, s in tests: yield check, i, o, s def test_valid_css(): """The sanitizer should fix missing CSS values.""" styles = ['color', 'float'] eq_('

foo

', clean('

foo

', styles=styles)) eq_('

foo

', clean('

foo

', styles=styles)) def test_style_hang(): """The sanitizer should not hang on any inline styles""" # TODO: Neaten this up. It's copypasta from MDN/Kuma to repro the bug style = ("""margin-top: 0px; margin-right: 0px; margin-bottom: 1.286em; """ """margin-left: 0px; padding-top: 15px; padding-right: 15px; """ """padding-bottom: 15px; padding-left: 15px; border-top-width: """ """1px; border-right-width: 1px; border-bottom-width: 1px; """ """border-left-width: 1px; border-top-style: dotted; """ """border-right-style: dotted; border-bottom-style: dotted; """ """border-left-style: dotted; border-top-color: rgb(203, 200, """ """185); border-right-color: rgb(203, 200, 185); """ """border-bottom-color: rgb(203, 200, 185); border-left-color: """ """rgb(203, 200, 185); background-image: initial; """ """background-attachment: initial; background-origin: initial; """ """background-clip: initial; background-color: """ """rgb(246, 246, 242); overflow-x: auto; overflow-y: auto; """ """font: normal normal normal 100%/normal 'Courier New', """ """'Andale Mono', monospace; background-position: initial """ """initial; background-repeat: initial initial;""") html = '

Hello world

'.format(style) styles = [ 'border', 'float', 'overflow', 'min-height', 'vertical-align', 'white-space', 'margin', 'margin-left', 'margin-top', 'margin-bottom', 'margin-right', 'padding', 'padding-left', 'padding-top', 'padding-bottom', 'padding-right', 'background', 'background-color', 'font', 'font-size', 'font-weight', 'text-align', 'text-transform', ] expected = ("""

""" """Hello world

""") result = clean(html, styles=styles) eq_(expected, result) python-bleach-1.4.2/bleach/tests/test_links.py000066400000000000000000000361601263103430700213710ustar00rootroot00000000000000try: from urllib.parse import quote_plus except ImportError: from urllib import quote_plus from html5lib.tokenizer import HTMLTokenizer from nose.tools import eq_ from bleach import linkify, url_re, DEFAULT_CALLBACKS as DC def test_url_re(): def no_match(s): match = url_re.search(s) if match: assert not match, 'matched {0!s}'.format(s[slice(*match.span())]) yield no_match, 'just what i am looking for...it' def test_empty(): eq_('', linkify('')) def test_simple_link(): eq_('a http://example.com' ' link', linkify('a http://example.com link')) eq_('a https://example.com' ' link', linkify('a https://example.com link')) eq_('a example.com link', linkify('a example.com link')) def test_trailing_slash(): eq_('http://examp.com/', linkify('http://examp.com/')) eq_('' 'http://example.com/foo/', linkify('http://example.com/foo/')) eq_('' 'http://example.com/foo/bar/', linkify('http://example.com/foo/bar/')) def test_mangle_link(): """We can muck with the href attribute of the link.""" def filter_url(attrs, new=False): quoted = quote_plus(attrs['href']) attrs['href'] = 'http://bouncer/?u={0!s}'.format(quoted) return attrs eq_('' 'http://example.com', linkify('http://example.com', DC + [filter_url])) def test_mangle_text(): """We can muck with the inner text of a link.""" def ft(attrs, new=False): attrs['_text'] = 'bar' return attrs eq_('bar bar', linkify('http://ex.mp foo', [ft])) def test_email_link(): tests = ( ('a james@example.com mailto', False, 'a james@example.com mailto'), ('a james@example.com.au mailto', False, 'a james@example.com.au mailto'), ('a james@example.com mailto', True, 'a james@example.com mailto'), ('aussie ' 'james@example.com.au mailto', True, 'aussie james@example.com.au mailto'), # This is kind of a pathological case. I guess we do our best here. ('email to ' 'james@example.com', True, 'email to james@example.com'), ('
' 'jinkyun@example.com', True, '
jinkyun@example.com'), ) def _check(o, p, i): eq_(o, linkify(i, parse_email=p)) for (o, p, i) in tests: yield _check, o, p, i def test_email_link_escaping(): tests = ( ('''''' '''"james"@example.com''', '"james"@example.com'), ('''''' '''"j'ames"@example.com''', '"j\'ames"@example.com'), ('''''' '''"ja>mes"@example.com''', '"ja>mes"@example.com'), ) def _check(o, i): eq_(o, linkify(i, parse_email=True)) for (o, i) in tests: yield _check, o, i def test_prevent_links(): """Returning None from any callback should remove links or prevent them from being created.""" def no_new_links(attrs, new=False): if new: return None return attrs def no_old_links(attrs, new=False): if not new: return None return attrs def noop(attrs, new=False): return attrs in_text = 'a ex.mp example' out_text = 'a ex.mp example' tests = ( ([noop], ('a ex.mp ' 'example'), 'noop'), ([no_new_links, noop], in_text, 'no new, noop'), ([noop, no_new_links], in_text, 'noop, no new'), ([no_old_links, noop], out_text, 'no old, noop'), ([noop, no_old_links], out_text, 'noop, no old'), ([no_old_links, no_new_links], 'a ex.mp example', 'no links'), ) def _check(cb, o, msg): eq_(o, linkify(in_text, cb), msg) for (cb, o, msg) in tests: yield _check, cb, o, msg def test_set_attrs(): """We can set random attributes on links.""" def set_attr(attrs, new=False): attrs['rev'] = 'canonical' return attrs eq_('ex.mp', linkify('ex.mp', [set_attr])) def test_only_proto_links(): """Only create links if there's a protocol.""" def only_proto(attrs, new=False): if new and not attrs['_text'].startswith(('http:', 'https:')): return None return attrs in_text = 'a ex.mp http://ex.mp bar' out_text = ('a ex.mp http://ex.mp ' 'bar') eq_(out_text, linkify(in_text, [only_proto])) def test_stop_email(): """Returning None should prevent a link from being created.""" def no_email(attrs, new=False): if attrs['href'].startswith('mailto:'): return None return attrs text = 'do not link james@example.com' eq_(text, linkify(text, parse_email=True, callbacks=[no_email])) def test_tlds(): eq_('example.com', linkify('example.com')) eq_('example.co', linkify('example.co')) eq_('example.co.uk', linkify('example.co.uk')) eq_('example.edu', linkify('example.edu')) eq_('example.xxx', linkify('example.xxx')) eq_('example.yyy', linkify('example.yyy')) eq_(' brie', linkify(' brie')) eq_('bit.ly/fun', linkify('bit.ly/fun')) def test_escaping(): eq_('< unrelated', linkify('< unrelated')) def test_nofollow_off(): eq_('example.com', linkify('example.com', [])) def test_link_in_html(): eq_('http://yy.com', linkify('http://yy.com')) eq_('http://xx.com' '', linkify('http://xx.com')) def test_links_https(): eq_('https://yy.com', linkify('https://yy.com')) def test_add_rel_nofollow(): """Verify that rel="nofollow" is added to an existing link""" eq_('http://yy.com', linkify('http://yy.com')) def test_url_with_path(): eq_('' 'http://example.com/path/to/file', linkify('http://example.com/path/to/file')) def test_link_ftp(): eq_('' 'ftp://ftp.mozilla.org/some/file', linkify('ftp://ftp.mozilla.org/some/file')) def test_link_query(): eq_('' 'http://xx.com/?test=win', linkify('http://xx.com/?test=win')) eq_('' 'xx.com/?test=win', linkify('xx.com/?test=win')) eq_('' 'xx.com?test=win', linkify('xx.com?test=win')) def test_link_fragment(): eq_('' 'http://xx.com/path#frag', linkify('http://xx.com/path#frag')) def test_link_entities(): eq_('' 'http://xx.com/?a=1&b=2', linkify('http://xx.com/?a=1&b=2')) def test_escaped_html(): """If I pass in escaped HTML, it should probably come out escaped.""" s = '<em>strong</em>' eq_(s, linkify(s)) def test_link_http_complete(): eq_('' 'https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f', linkify('https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f')) def test_non_url(): """document.vulnerable should absolutely not be linkified.""" s = 'document.vulnerable' eq_(s, linkify(s)) def test_javascript_url(): """javascript: urls should never be linkified.""" s = 'javascript:document.vulnerable' eq_(s, linkify(s)) def test_unsafe_url(): """Any unsafe char ({}[]<>, etc.) in the path should end URL scanning.""" eq_('All your{"xx.yy.com/grover.png"}base are', linkify('All your{"xx.yy.com/grover.png"}base are')) def test_skip_pre(): """Skip linkification in
 tags."""
    simple = 'http://xx.com 
http://xx.com
' linked = ('http://xx.com ' '
http://xx.com
') all_linked = ('http://xx.com ' '
http://xx.com'
                  '
') eq_(linked, linkify(simple, skip_pre=True)) eq_(all_linked, linkify(simple)) already_linked = '
xx
' nofollowed = '
xx
' eq_(nofollowed, linkify(already_linked)) eq_(nofollowed, linkify(already_linked, skip_pre=True)) def test_libgl(): """libgl.so.1 should not be linkified.""" eq_('libgl.so.1', linkify('libgl.so.1')) def test_end_of_sentence(): """example.com. should match.""" out = '{0!s}{1!s}' intxt = '{0!s}{1!s}' def check(u, p): eq_(out.format(u, p), linkify(intxt.format(u, p))) tests = ( ('example.com', '.'), ('example.com', '...'), ('ex.com/foo', '.'), ('ex.com/foo', '....'), ) for u, p in tests: yield check, u, p def test_end_of_clause(): """example.com/foo, shouldn't include the ,""" eq_('ex.com/foo, bar', linkify('ex.com/foo, bar')) def test_sarcasm(): """Jokes should crash.""" dirty = 'Yeah right ' clean = 'Yeah right <sarcasm/>' eq_(clean, linkify(dirty)) def test_wrapping_parentheses(): """URLs wrapped in parantheses should not include them.""" out = '{0!s}{2!s}{3!s}' tests = ( ('(example.com)', ('(', 'example.com', 'example.com', ')')), ('(example.com/)', ('(', 'example.com/', 'example.com/', ')')), ('(example.com/foo)', ('(', 'example.com/foo', 'example.com/foo', ')')), ('(((example.com/))))', ('(((', 'example.com/)', 'example.com/)', ')))')), ('example.com/))', ('', 'example.com/))', 'example.com/))', '')), ('http://en.wikipedia.org/wiki/Test_(assessment)', ('', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', '')), ('(http://en.wikipedia.org/wiki/Test_(assessment))', ('(', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', ')')), ('((http://en.wikipedia.org/wiki/Test_(assessment))', ('((', 'en.wikipedia.org/wiki/Test_(assessment', 'http://en.wikipedia.org/wiki/Test_(assessment', '))')), ('(http://en.wikipedia.org/wiki/Test_(assessment)))', ('(', 'en.wikipedia.org/wiki/Test_(assessment))', 'http://en.wikipedia.org/wiki/Test_(assessment))', ')')), ('(http://en.wikipedia.org/wiki/)Test_(assessment', ('(', 'en.wikipedia.org/wiki/)Test_(assessment', 'http://en.wikipedia.org/wiki/)Test_(assessment', '')), ) def check(test, expected_output): eq_(out.format(*expected_output), linkify(test)) for test, expected_output in tests: yield check, test, expected_output def test_parentheses_with_removing(): expect = '(test.py)' eq_(expect, linkify(expect, callbacks=[lambda *a: None])) def test_ports(): """URLs can contain port numbers.""" tests = ( ('http://foo.com:8000', ('http://foo.com:8000', '')), ('http://foo.com:8000/', ('http://foo.com:8000/', '')), ('http://bar.com:xkcd', ('http://bar.com', ':xkcd')), ('http://foo.com:81/bar', ('http://foo.com:81/bar', '')), ('http://foo.com:', ('http://foo.com', ':')), ) def check(test, output): out = '{0}{1}' eq_(out.format(*output), linkify(test)) for test, output in tests: yield check, test, output def test_tokenizer(): """Linkify doesn't always have to sanitize.""" raw = 'test' eq_('test<x></x>', linkify(raw)) eq_(raw, linkify(raw, tokenizer=HTMLTokenizer)) def test_ignore_bad_protocols(): eq_('foohttp://bar', linkify('foohttp://bar')) eq_('fohttp://exampl.com', linkify('fohttp://exampl.com')) def test_max_recursion_depth(): """If we hit the max recursion depth, just return the string.""" test = '' * 2000 + 'foo' + '' * 2000 eq_(test, linkify(test)) def test_link_emails_and_urls(): """parse_email=True shouldn't prevent URLs from getting linkified.""" output = ('' 'http://example.com ' 'person@example.com') eq_(output, linkify('http://example.com person@example.com', parse_email=True)) def test_links_case_insensitive(): """Protocols and domain names are case insensitive.""" expect = ('' 'HTTP://EXAMPLE.COM') eq_(expect, linkify('HTTP://EXAMPLE.COM')) def test_elements_inside_links(): eq_('hello
', linkify('hello
')) eq_('bold hello
', linkify('bold hello
')) def test_remove_first_childlink(): expect = '

something

' callbacks = [lambda *a: None] eq_(expect, linkify('

something

', callbacks=callbacks)) python-bleach-1.4.2/bleach/tests/test_security.py000066400000000000000000000073221263103430700221160ustar00rootroot00000000000000"""More advanced security tests""" from nose.tools import eq_ from bleach import clean def test_nested_script_tag(): eq_('<<script>script>evil()<</script>/script>', clean('</script>')) eq_('<<x>script>evil()<</x>/script>', clean('<script>evil()</script>')) def test_nested_script_tag_r(): eq_('<script<script>>evil()</script<>>', clean('>evil()>')) def test_invalid_attr(): IMG = ['img', ] IMG_ATTR = ['src'] eq_('test', clean('test')) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) def test_unquoted_attr(): eq_('myabbr', clean('myabbr')) def test_unquoted_event_handler(): eq_('xx.com', clean('xx.com')) def test_invalid_attr_value(): eq_('<img src="javascript:alert(\'XSS\');">', clean('')) def test_invalid_href_attr(): eq_('xss', clean('xss')) def test_invalid_filter_attr(): IMG = ['img', ] IMG_ATTR = {'img': lambda n, v: n == 'src' and v == "http://example.com/"} eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) def test_invalid_tag_char(): eq_('<script xss="" src="http://xx.com/xss.js"></script>', clean('')) eq_('<script src="http://xx.com/xss.js"></script>', clean('')) def test_unclosed_tag(): eq_('<script src="http://xx.com/xss.js&lt;b">', clean('ipt>' eq_('pt>alert(1)ipt>', clean(s, strip=True)) s = 'pt>pt>alert(1)' eq_('pt>pt>alert(1)', clean(s, strip=True)) def test_nasty(): """Nested, broken up, multiple tags, are still foiled!""" test = ('ipt type="text/javascript">alert("foo");script>') expect = ('<scr<script></script>ipt type="text/javascript"' '>alert("foo");</script>script<del></del>' '>') eq_(expect, clean(test)) def test_poster_attribute(): """Poster attributes should not allow javascript.""" tags = ['video'] attrs = {'video': ['poster']} test = '' expect = '' eq_(expect, clean(test, tags=tags, attributes=attrs)) ok = '' eq_(ok, clean(ok, tags=tags, attributes=attrs)) def test_feed_protocol(): eq_('foo', clean('foo')) python-bleach-1.4.2/bleach/tests/test_unicode.py000066400000000000000000000041501263103430700216710ustar00rootroot00000000000000# -*- coding: utf-8 -*- from __future__ import unicode_literals from nose.tools import eq_ from bleach import clean, linkify from bleach.tests.tools import in_ def test_japanese_safe_simple(): eq_('ヘルプとチュートリアル', clean('ヘルプとチュートリアル')) eq_('ヘルプとチュートリアル', linkify('ヘルプとチュートリアル')) def test_japanese_strip(): eq_('ヘルプとチュートリアル', clean('ヘルプとチュートリアル')) eq_('<span>ヘルプとチュートリアル</span>', clean('ヘルプとチュートリアル')) def test_russian_simple(): eq_('Домашняя', clean('Домашняя')) eq_('Домашняя', linkify('Домашняя')) def test_mixed(): eq_('Домашняяヘルプとチュートリアル', clean('Домашняяヘルプとチュートリアル')) def test_mixed_linkify(): in_(('Домашняя ' 'http://example.com ヘルプとチュートリアル', 'Домашняя ' 'http://example.com ヘルプとチュートリアル'), linkify('Домашняя http://example.com ヘルプとチュートリアル')) def test_url_utf8(): """Allow UTF8 characters in URLs themselves.""" outs = ('{0!s}', '{0!s}') out = lambda url: [x.format(url) for x in outs] tests = ( ('http://éxámplé.com/', out('http://éxámplé.com/')), ('http://éxámplé.com/íàñá/', out('http://éxámplé.com/íàñá/')), ('http://éxámplé.com/íàñá/?foo=bar', out('http://éxámplé.com/íàñá/?foo=bar')), ('http://éxámplé.com/íàñá/?fóo=bár', out('http://éxámplé.com/íàñá/?fóo=bár')), ) def check(test, expected_output): in_(expected_output, linkify(test)) for test, expected_output in tests: yield check, test, expected_output python-bleach-1.4.2/bleach/tests/tools.py000066400000000000000000000002601263103430700203420ustar00rootroot00000000000000 def in_(l, a, msg=None): """Shorthand for 'assert a in l, "%r not in %r" % (a, l) """ if a not in l: raise AssertionError(msg or "%r not in %r" % (a, l)) python-bleach-1.4.2/docs/000077500000000000000000000000001263103430700152025ustar00rootroot00000000000000python-bleach-1.4.2/docs/Makefile000066400000000000000000000126741263103430700166540ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Bleach.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Bleach.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/Bleach" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Bleach" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." python-bleach-1.4.2/docs/clean.rst000066400000000000000000000065121263103430700170220ustar00rootroot00000000000000.. _clean-chapter: .. highlightlang:: python ================== ``bleach.clean()`` ================== ``clean()`` is Bleach's HTML sanitization method:: def clean(text, tags=ALLOWED_TAGS, attributes=ALLOWED_ATTRIBUTES, styles=ALLOWED_STYLES, strip=False, strip_comments=True): """Clean an HTML fragment and return it.""" Given a fragment of HTML, Bleach will parse it according to the HTML5 parsing algorithm and sanitize any disallowed tags or attributes. This algorithm also takes care of things like unclosed and (some) misnested tags. .. note:: You may pass in a ``string`` or a ``unicode`` object, but Bleach will always return ``unicode``. Tag Whitelist ============= The ``tags`` kwarg is a whitelist of allowed HTML tags. It should be a list, tuple, or other iterable. Any other HTML tags will be escaped or stripped from the text. Its default value is a relatively conservative list found in ``bleach.ALLOWED_TAGS``. Attribute Whitelist =================== The ``attributes`` kwarg is a whitelist of attributes. It can be a list, in which case the attributes are allowed for any tag, or a dictionary, in which case the keys are tag names (or a wildcard: ``*`` for all tags) and the values are lists of allowed attributes. For example:: attrs = { '*': ['class'], 'a': ['href', 'rel'], 'img': ['src', 'alt'], } In this case, ``class`` is allowed on any allowed element (from the ``tags`` argument), ```` tags are allowed to have ``href`` and ``rel`` attributes, and so on. The default value is also a conservative dict found in ``bleach.ALLOWED_ATTRIBUTES``. Callable Filters ---------------- You can also use a callable (instead of a list) in the ``attributes`` kwarg. If the callable returns ``True``, the attribute is allowed. Otherwise, it is stripped. For example:: def filter_src(name, value): if name in ('alt', 'height', 'width'): return True if name == 'src': p = urlparse(value) return (not p.netloc) or p.netloc == 'mydomain.com' return False attrs = { 'img': filter_src, } Styles Whitelist ================ If you allow the ``style`` attribute, you will also need to whitelist styles users are allowed to set, for example ``color`` and ``background-color``. The default value is an empty list, i.e., the ``style`` attribute will be allowed but no values will be. For example, to allow users to set the color and font-weight of text:: attrs = { '*': ['style'] } tags = ['p', 'em', 'strong'] styles = ['color', 'font-weight'] cleaned_text = bleach.clean(text, tags, attrs, styles) Stripping Markup ================ By default, Bleach *escapes* disallowed or invalid markup. For example:: >>> bleach.clean('is not allowed') u'<span>is not allowed</span> If you would rather Bleach stripped this markup entirely, you can pass ``strip=True``:: >>> bleach.clean('is not allowed', strip=True) u'is not allowed' Stripping Comments ================== By default, Bleach will strip out HTML comments. To disable this behavior, set ``strip_comments=False``:: >>> html = 'my html' >>> bleach.clean(html) u'my html' >>> bleach.clean(html, strip_comments=False) u'my html' python-bleach-1.4.2/docs/conf.py000066400000000000000000000171441263103430700165100ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # Bleach documentation build configuration file, created by # sphinx-quickstart on Fri May 11 21:11:39 2012. # # This file is execfile()d with the current directory set to its containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys, os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ----------------------------------------------------- # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = ['sphinx.ext.autodoc', 'sphinx.ext.pngmath', 'sphinx.ext.viewcode'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'Bleach' copyright = u'2012-2104, James Socol' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '1.4' # The full version, including alpha/beta/rc tags. release = '1.4.1' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # -- Options for HTML output --------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'default' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'Bleachdoc' # -- Options for LaTeX output -------------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass [howto/manual]). latex_documents = [ ('index', 'Bleach.tex', u'Bleach Documentation', u'James Socol', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output -------------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'bleach', u'Bleach Documentation', [u'James Socol'], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------------ # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'Bleach', u'Bleach Documentation', u'James Socol', 'Bleach', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' python-bleach-1.4.2/docs/goals.rst000066400000000000000000000053111263103430700170410ustar00rootroot00000000000000=============== Goals of Bleach =============== This document lists the goals and non-goals of Bleach. My hope is that by focusing on these goals and explicitly listing the non-goals, the project will evolve in a stronger direction. Goals ===== Whitelisting ------------ Bleach should always take a whitelist-based approach to allowing any kind of content or markup. Blacklisting is error-prone and not future proof. For example, you should have to opt-in to allowing the ``onclick`` attribute, not blacklist all the other ``on*`` attributes. Future versions of HTML may add new event handlers, like ``ontouch``, that old blacklists would not prevent. Sanitizing Input ---------------- The primary goal of Bleach is to sanitize user input that is allowed to contain *some* HTML as markup and is to be included in the content of a larger page. Examples might include: * User comments on a blog. * "Bio" sections of a user profile. * Descriptions of a product or application. These examples, and others, are traditionally prone to security issues like XSS or other script injection, or annoying issues like unclosed tags and invalid markup. Bleach will take a proactive, whitelist-only approach to allowing HTML content, and will use the HTML5 parsing algorithm to handle invalid markup. See the :ref:`chapter on clean() ` for more info. Safely Creating Links --------------------- The secondary goal of Bleach is to provide a mechanism for finding or altering links (```` tags with ``href`` attributes, or things that look like URLs or email addresses) in text. While Bleach itself will always operate on a whitelist-based security model, the :ref:`linkify() method ` is flexible enough to allow the creation, alteration, and removal of links based on an extremely wide range of use cases. Non-Goals ========= Bleach is designed to work with fragments of HTML by untrusted users. Some non-goal use cases include: * **Sanitizing complete HTML documents.** Once you're creating whole documents, you have to allow so many tags that a blacklist approach (e.g. forbidding ``