bleach-1.2.2/0000755000175000017500000000000012145717650013124 5ustar avtobiffavtobiffbleach-1.2.2/CHANGES0000644000175000017500000000376012145717650014125 0ustar avtobiffavtobiffBleach Changes ============== Version 1.2.1 ------------- - clean() no longer considers "feed:" an acceptable protocol due to inconsistencies in browser behavior. Version 1.2 ----------- - linkify() has changed considerably. Many keyword arguments have been replaced with a single callbacks list. Please see the documentation for more information. - Bleach will no longer consider unacceptable protocols when linkifying. - linkify() now takes a tokenizer argument that allows it to skip sanitization. - delinkify() is gone. - Removed exception handling from _render. clean() and linkify() may now throw. - linkify() correctly ignores case for protocols and domain names. - linkify() correctly handles markup within an tag. Version 1.1.3 ------------- - Fix parsing bare URLs when parse_email=True. Version 1.1.2 ------------- - Fix hang in style attribute sanitizer. (#61) - Allow '/' in style attribute values. Version 1.1.1 ------------- - Fix tokenizer for html5lib 0.9.5. Version 1.1.0 ------------- - linkify() now understands port numbers. (#38) - Documented character encoding behavior. (#41) - Add an optional target argument to linkify(). - Add delinkify() method. (#45) - Support subdomain whitelist for delinkify(). (#47, #48) Version 1.0.4 ------------- - Switch to SemVer git tags. - Make linkify() smarter about trailing punctuation. (#30) - Pass exc_info to logger during rendering issues. - Add wildcard key for attributes. (#19) - Make linkify() use the HTMLSanitizer tokenizer. (#36) - Fix URLs wrapped in parentheses. (#23) - Make linkify() UTF-8 safe. (#33) Version 1.0.3 ------------- - linkify() works with 3rd level domains. (#24) - clean() supports vendor prefixes in style values. (#31, #32) - Fix linkify() email escaping. Version 1.0.2 ------------- - linkify() supports email addresses. - clean() supports callables in attributes filter. Version 1.0.1 ------------- - linkify() doesn't drop trailing slashes. (#21) - linkify() won't linkify 'libgl.so.1'. (#22) bleach-1.2.2/CONTRIBUTORS0000644000175000017500000000063712145717650015012 0ustar avtobiffavtobiffBleach is written and maintained by James Socol and various contributors within and without the Mozilla Corporation and Foundation. Lead Developer: - James Socol Contributors: - Jeff Balogh - Ricky Rosario - Chris Beaven - Luis Nell Patches: - Les Orchard - Paul Craciunoiu - Sébastien Fievet - TimothyFitz - Adrian "ThiefMaster" - Adam Lofts - Anton Kovalyov - Mark Paschal - Alex Ehlke bleach-1.2.2/LICENSE0000644000175000017500000000276212145717650014140 0ustar avtobiffavtobiffCopyright (c) 2010, Mozilla Foundation All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of bleach nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. bleach-1.2.2/setup.py0000644000175000017500000000161412145717650014640 0ustar avtobiffavtobifffrom setuptools import setup, find_packages setup( name='bleach', version='1.2.2', description='An easy whitelist-based HTML-sanitizing tool.', long_description=open('README.rst').read(), author='James Socol', author_email='james@mozilla.com', url='http://github.com/jsocol/bleach', license='BSD', packages=find_packages(), include_package_data=True, package_data={'': ['README.rst']}, zip_safe=False, install_requires=['html5lib==0.95'], classifiers=[ 'Development Status :: 4 - Beta', 'Environment :: Web Environment', 'Environment :: Web Environment :: Mozilla', 'Intended Audience :: Developers', 'License :: OSI Approved :: BSD License', 'Operating System :: OS Independent', 'Programming Language :: Python', 'Topic :: Software Development :: Libraries :: Python Modules', ] ) bleach-1.2.2/docs/0000755000175000017500000000000012145717650014054 5ustar avtobiffavtobiffbleach-1.2.2/docs/goals.rst0000644000175000017500000000511312145717650015713 0ustar avtobiffavtobiff=============== Goals of Bleach =============== This document lists the goals and non-goals of Bleach. My hope is that by focusing on these goals and explicitly listing the non-goals, the project will evolve in a stronger direction. Goals ===== Whitelisting ------------ Bleach should always take a whitelist-based approach to allowing any kind of content or markup. Blacklisting is error-prone and not future proof. For example, you should have to opt-in to allowing the ``onclick`` attribute, not blacklist all the other ``on*`` attributes. Future versions of HTML may add new event handlers, like ``ontouch``, that old blacklists would not prevent. Sanitizing Input ---------------- The primary goal of Bleach is to sanitize user input that is allowed to contain *some* HTML as markup and is to be included in the content of a larger page. Examples might include: * User comments on a blog. * "Bio" sections of a user profile. * Descriptions of a product or application. These examples, and others, are traditionally prone to security issues like XSS or other script injection, or annoying issues like unclosed tags and invalid markup. Bleach will take a proactive, whitelist-only approach to allowing HTML content, and will use the HTML5 parsing algorithm to handle invalid markup. See the :ref:`chapter on clean() ` for more info. Safely Creating Links --------------------- The secondary goal of Bleach is to provide a mechanism for finding or altering links (```` tags with ``href`` attributes, or things that look like URLs or email addresses) in text. While Bleach itself will always operate on a whitelist-based security model, the :ref:`linkify() method ` is flexible enough to allow the creation, alteration, and removal of links based on an extremely wide range of use cases. Non-Goals ========= Bleach is designed to work with fragments of HTML by untrusted users. Some non-goal use cases include: * **Sanitizing complete HTML documents.** Once you're creating whole documents, you have to allow so many tags that a blacklist approach (e.g. forbidding `` example') u'an <script>evil()</script> example' >>> bleach.linkify('an http://example.com url') u'an http://example.com url *NB*: Bleach always returns a ``unicode`` object, whether you give it a bytestring or a ``unicode`` object, but Bleach does not attempt to detect incoming character encodings, and will assume UTF-8. If you are using a different character encoding, you should convert from a bytestring to ``unicode`` before passing the text to Bleach. Installation ------------ Bleach is available on PyPI_, so you can install it with ``pip``:: $ pip install bleach Or with ``easy_install``:: $ easy_install bleach Or by cloning the repo from GitHub_:: $ git clone git://github.com/jsocol/bleach.git Then install it by running:: $ python setup.py install .. _html5lib: http://code.google.com/p/html5lib/ .. _GitHub: https://github.com/jsocol/bleach .. _ReadTheDocs: http://bleach.readthedocs.org/ .. _PyPI: http://pypi.python.org/pypi/bleach bleach-1.2.2/bleach/0000755000175000017500000000000012145717650014342 5ustar avtobiffavtobiffbleach-1.2.2/bleach/encoding.py0000644000175000017500000000375412145717650016513 0ustar avtobiffavtobiffimport datetime from decimal import Decimal import types def is_protected_type(obj): """Determine if the object instance is of a protected type. Objects of protected types are preserved as-is when passed to force_unicode(strings_only=True). """ return isinstance(obj, ( types.NoneType, int, long, datetime.datetime, datetime.date, datetime.time, float, Decimal) ) def force_unicode(s, encoding='utf-8', strings_only=False, errors='strict'): """ Similar to smart_unicode, except that lazy instances are resolved to strings, rather than kept as lazy objects. If strings_only is True, don't convert (some) non-string-like objects. """ if strings_only and is_protected_type(s): return s try: if not isinstance(s, basestring,): if hasattr(s, '__unicode__'): s = unicode(s) else: try: s = unicode(str(s), encoding, errors) except UnicodeEncodeError: if not isinstance(s, Exception): raise # If we get to here, the caller has passed in an Exception # subclass populated with non-ASCII data without special # handling to display as a string. We need to handle this # without raising a further exception. We do an # approximation to what the Exception's standard str() # output should be. s = ' '.join([force_unicode(arg, encoding, strings_only, errors) for arg in s]) elif not isinstance(s, unicode): # Note: We use .decode() here, instead of unicode(s, encoding, # errors), so that if s is a SafeString, it ends up being a # SafeUnicode at the end. s = s.decode(encoding, errors) except UnicodeDecodeError, e: raise UnicodeDecodeError(*e.args) return s bleach-1.2.2/bleach/__init__.py0000644000175000017500000002473012145717650016461 0ustar avtobiffavtobiffimport logging import re import sys import html5lib from html5lib.sanitizer import HTMLSanitizer from html5lib.serializer.htmlserializer import HTMLSerializer from . import callbacks as linkify_callbacks from .encoding import force_unicode from .sanitizer import BleachSanitizer VERSION = (1, 2, 1) __version__ = '1.2.1' __all__ = ['clean', 'linkify'] log = logging.getLogger('bleach') ALLOWED_TAGS = [ 'a', 'abbr', 'acronym', 'b', 'blockquote', 'code', 'em', 'i', 'li', 'ol', 'strong', 'ul', ] ALLOWED_ATTRIBUTES = { 'a': ['href', 'title'], 'abbr': ['title'], 'acronym': ['title'], } ALLOWED_STYLES = [] TLDS = """ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bv bw by bz ca cat cc cd cf cg ch ci ck cl cm cn co com coop cr cu cv cx cy cz de dj dk dm do dz ec edu ee eg er es et eu fi fj fk fm fo fr ga gb gd ge gf gg gh gi gl gm gn gov gp gq gr gs gt gu gw gy hk hm hn hr ht hu id ie il im in info int io iq ir is it je jm jo jobs jp ke kg kh ki km kn kp kr kw ky kz la lb lc li lk lr ls lt lu lv ly ma mc md me mg mh mil mk ml mm mn mo mobi mp mq mr ms mt mu museum mv mw mx my mz na name nc ne net nf ng ni nl no np nr nu nz om org pa pe pf pg ph pk pl pm pn pr pro ps pt pw py qa re ro rs ru rw sa sb sc sd se sg sh si sj sk sl sm sn so sr st su sv sy sz tc td tel tf tg th tj tk tl tm tn to tp tr travel tt tv tw tz ua ug uk us uy uz va vc ve vg vi vn vu wf ws xn ye yt yu za zm zw""".split() PROTOCOLS = HTMLSanitizer.acceptable_protocols TLDS.reverse() url_re = re.compile( r"""\(* # Match any opening parentheses. \b(?"]*)? # /path/zz (excluding "unsafe" chars from RFC 1738, # except for # and ~, which happen in practice) """ % (u'|'.join(PROTOCOLS), u'|'.join(TLDS)), re.IGNORECASE | re.VERBOSE | re.UNICODE) proto_re = re.compile(r'^[\w-]+:/{0,3}', re.IGNORECASE) punct_re = re.compile(r'([\.,]+)$') email_re = re.compile( r"""(?%s' attribs = ' '.join('%s="%s"' % (k, v) for k, v in link.items()) return repl % (_href, attribs, _text) def link_repl(match): url = match.group(0) open_brackets = close_brackets = 0 if url.startswith('('): url, open_brackets, close_brackets = ( strip_wrapping_parentheses(url) ) end = u'' m = re.search(punct_re, url) if m: end = m.group(0) url = url[0:m.start()] if re.search(proto_re, url): href = url else: href = u''.join([u'http://', url]) link = { '_text': url, 'href': href, } link = apply_callbacks(link, True) if link is None: return url _text = link.pop('_text') _href = link.pop('href') repl = u'%s%s%s%s' attribs = ' '.join('%s="%s"' % (k, v) for k, v in link.items()) return repl % ('(' * open_brackets, _href, attribs, _text, end, ')' * close_brackets) try: linkify_nodes(forest) except (RECURSION_EXCEPTION), e: # If we hit the max recursion depth, just return what we've got. log.exception('Probable recursion error: %r' % e) return _render(forest) def _render(tree): """Try rendering as HTML, then XML, then give up.""" try: return force_unicode(_serialize(tree)) except AssertionError: # The treewalker throws this sometimes. return force_unicode(tree.toxml()) def _serialize(domtree): walker = html5lib.treewalkers.getTreeWalker('simpletree') stream = walker(domtree) serializer = HTMLSerializer(quote_attr_values=True, omit_optional_tags=False) return serializer.render(stream) bleach-1.2.2/bleach/tests/0000755000175000017500000000000012145717650015504 5ustar avtobiffavtobiffbleach-1.2.2/bleach/tests/__init__.py0000644000175000017500000000000012145717650017603 0ustar avtobiffavtobiffbleach-1.2.2/bleach/tests/test_security.py0000644000175000017500000000733312145717650020772 0ustar avtobiffavtobiff"""More advanced security tests""" from nose.tools import eq_ from bleach import clean def test_nested_script_tag(): eq_('<<script>script>evil()<</script>/script>', clean('</script>')) eq_('<<x>script>evil()<</x>/script>', clean('<script>evil()</script>')) def test_nested_script_tag_r(): eq_('<script<script>>evil()</script<>>', clean('>evil()>')) def test_invalid_attr(): IMG = ['img', ] IMG_ATTR = ['src'] eq_('test', clean('test')) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) def test_unquoted_attr(): eq_('myabbr', clean('myabbr')) def test_unquoted_event_handler(): eq_('xx.com', clean('xx.com')) def test_invalid_attr_value(): eq_('<img src="javascript:alert(\'XSS\');">', clean('')) def test_invalid_href_attr(): eq_('xss', clean('xss')) def test_invalid_filter_attr(): IMG = ['img', ] IMG_ATTR = {'img': lambda n, v: n == 'src' and v == "http://example.com/"} eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) eq_('', clean('', tags=IMG, attributes=IMG_ATTR)) def test_invalid_tag_char(): eq_('<script xss="" src="http://xx.com/xss.js"></script>', clean('')) eq_('<script src="http://xx.com/xss.js"></script>', clean('')) def test_unclosed_tag(): eq_('<script src="http://xx.com/xss.js&lt;b">', clean('ipt>' eq_('pt>alert(1)ipt>', clean(s, strip=True)) s = 'pt>pt>alert(1)' eq_('pt>pt>alert(1)', clean(s, strip=True)) def test_nasty(): """Nested, broken up, multiple tags, are still foiled!""" test = ('ipt type="text/javascript">alert("foo");script>') expect = (u'<scr<script></script>ipt type="text/javascript"' u'>alert("foo");</script>script<del></del>' u'>') eq_(expect, clean(test)) def test_poster_attribute(): """Poster attributes should not allow javascript.""" tags = ['video'] attrs = {'video': ['poster']} test = '' expect = '' eq_(expect, clean(test, tags=tags, attributes=attrs)) ok = '' eq_(ok, clean(ok, tags=tags, attributes=attrs)) def test_feed_protocol(): eq_('foo', clean('foo')) bleach-1.2.2/bleach/tests/test_unicode.py0000644000175000017500000000356512145717650020554 0ustar avtobiffavtobiff# -*- coding: utf-8 -*- from nose.tools import eq_ from bleach import clean, linkify def test_japanese_safe_simple(): eq_(u'ヘルプとチュートリアル', clean(u'ヘルプとチュートリアル')) eq_(u'ヘルプとチュートリアル', linkify(u'ヘルプとチュートリアル')) def test_japanese_strip(): eq_(u'ヘルプとチュートリアル', clean(u'ヘルプとチュートリアル')) eq_(u'<span>ヘルプとチュートリアル</span>', clean(u'ヘルプとチュートリアル')) def test_russian_simple(): eq_(u'Домашняя', clean(u'Домашняя')) eq_(u'Домашняя', linkify(u'Домашняя')) def test_mixed(): eq_(u'Домашняяヘルプとチュートリアル', clean(u'Домашняяヘルプとチュートリアル')) def test_mixed_linkify(): eq_(u'Домашняя ' u'http://example.com ヘルプとチュートリアル', linkify(u'Домашняя http://example.com ヘルプとチュートリアル')) def test_url_utf8(): """Allow UTF8 characters in URLs themselves.""" out = u'%(url)s' tests = ( ('http://éxámplé.com/', out % {'url': u'http://éxámplé.com/'}), ('http://éxámplé.com/íàñá/', out % {'url': u'http://éxámplé.com/íàñá/'}), ('http://éxámplé.com/íàñá/?foo=bar', out % {'url': u'http://éxámplé.com/íàñá/?foo=bar'}), ('http://éxámplé.com/íàñá/?fóo=bár', out % {'url': u'http://éxámplé.com/íàñá/?fóo=bár'}), ) def check(test, expected_output): eq_(expected_output, linkify(test)) for test, expected_output in tests: yield check, test, expected_output bleach-1.2.2/bleach/tests/test_basics.py0000644000175000017500000001202412145717650020360 0ustar avtobiffavtobiffimport html5lib from nose.tools import eq_ import bleach def test_empty(): eq_('', bleach.clean('')) def test_nbsp(): eq_(u'\xa0test string\xa0', bleach.clean(' test string ')) def test_comments_only(): comment = '' open_comment = '' % open_comment, bleach.clean(open_comment, strip_comments=False)) def test_with_comments(): html = 'Just text' eq_('Just text', bleach.clean(html)) eq_(html, bleach.clean(html, strip_comments=False)) def test_no_html(): eq_('no html string', bleach.clean('no html string')) def test_allowed_html(): eq_('an allowed tag', bleach.clean('an allowed tag')) eq_('another good tag', bleach.clean('another good tag')) def test_bad_html(): eq_('a fixed tag', bleach.clean('a fixed tag')) def test_function_arguments(): TAGS = ['span', 'br'] ATTRS = {'span': ['style']} eq_('a
test', bleach.clean('a
test', tags=TAGS, attributes=ATTRS)) def test_named_arguments(): ATTRS = {'a': ['rel', 'href']} s = u'xx.com' eq_('xx.com', bleach.clean(s)) eq_(s, bleach.clean(s, attributes=ATTRS)) def test_disallowed_html(): eq_('a <script>safe()</script> test', bleach.clean('a test')) eq_('a <style>body{}</style> test', bleach.clean('a test')) def test_bad_href(): eq_('no link', bleach.clean('no link')) def test_bare_entities(): eq_('an & entity', bleach.clean('an & entity')) eq_('an < entity', bleach.clean('an < entity')) eq_('tag < and entity', bleach.clean('tag < and entity')) eq_('&', bleach.clean('&')) def test_escaped_entities(): s = u'<em>strong</em>' eq_(s, bleach.clean(s)) def test_serializer(): s = u'
' eq_(s, bleach.clean(s, tags=['table'])) eq_(u'test
', bleach.linkify(u'test
')) eq_(u'

test

', bleach.clean(u'

test

', tags=['p'])) def test_no_href_links(): s = u'x' eq_(s, bleach.linkify(s)) def test_weird_strings(): s = 'with
html tags', bleach.clean('a test with html tags', strip=True)) eq_('a test with html tags', bleach.clean('a test with ' 'html tags', strip=True)) s = '

link text

' eq_('

link text

', bleach.clean(s, tags=['p'], strip=True)) s = '

multiply nested text

' eq_('

multiply nested text

', bleach.clean(s, tags=['p'], strip=True)) s = ('

' '

') eq_('

', bleach.clean(s, tags=['p', 'a'], strip=True)) def test_allowed_styles(): ATTR = ['style'] STYLE = ['color'] blank = '' s = '' eq_(blank, bleach.clean('', attributes=ATTR)) eq_(s, bleach.clean(s, attributes=ATTR, styles=STYLE)) eq_(s, bleach.clean('', attributes=ATTR, styles=STYLE)) def test_idempotent(): """Make sure that applying the filter twice doesn't change anything.""" dirty = u'invalid & < extra http://link.com' clean = bleach.clean(dirty) eq_(clean, bleach.clean(clean)) linked = bleach.linkify(dirty) eq_(linked, bleach.linkify(linked)) def test_lowercase_html(): """We should output lowercase HTML.""" dirty = u'BAR' clean = u'BAR' eq_(clean, bleach.clean(dirty, attributes=['class'])) def test_wildcard_attributes(): ATTR = { '*': ['id'], 'img': ['src'], } TAG = ['img', 'em'] dirty = (u'both can have ' u'') clean = u'both can have ' eq_(clean, bleach.clean(dirty, tags=TAG, attributes=ATTR)) def test_sarcasm(): """Jokes should crash.""" dirty = u'Yeah right ' clean = u'Yeah right <sarcasm/>' eq_(clean, bleach.clean(dirty)) bleach-1.2.2/bleach/tests/test_links.py0000644000175000017500000003503112145717650020237 0ustar avtobiffavtobiffimport urllib from html5lib.tokenizer import HTMLTokenizer from nose.tools import eq_ from bleach import linkify, url_re, DEFAULT_CALLBACKS as DC def test_url_re(): def no_match(s): match = url_re.search(s) if match: assert not match, 'matched %s' % s[slice(*match.span())] yield no_match, 'just what i am looking for...it' def test_empty(): eq_('', linkify('')) def test_simple_link(): eq_('a http://example.com' ' link', linkify('a http://example.com link')) eq_('a https://example.com' ' link', linkify('a https://example.com link')) eq_('an example.com link', linkify('an example.com link')) def test_trailing_slash(): eq_('http://example.com/', linkify('http://example.com/')) eq_('' 'http://example.com/foo/', linkify('http://example.com/foo/')) eq_('' 'http://example.com/foo/bar/', linkify('http://example.com/foo/bar/')) def test_mangle_link(): """We can muck with the href attribute of the link.""" def filter_url(attrs, new=False): attrs['href'] = (u'http://bouncer/?u=%s' % urllib.quote_plus(attrs['href'])) return attrs eq_('' 'http://example.com', linkify('http://example.com', DC + [filter_url])) def test_mangle_text(): """We can muck with the inner text of a link.""" def ft(attrs, new=False): attrs['_text'] = 'bar' return attrs eq_('bar bar', linkify('http://ex.mp foo', [ft])) def test_email_link(): tests = ( ('a james@example.com mailto', False, 'a james@example.com mailto'), ('a james@example.com.au mailto', False, 'a james@example.com.au mailto'), ('a james@example.com mailto', True, 'a james@example.com mailto'), ('aussie ' 'james@example.com.au mailto', True, 'aussie james@example.com.au mailto'), # This is kind of a pathological case. I guess we do our best here. ('email to ' 'james@example.com', True, 'email to james@example.com'), ) def _check(o, p, i): eq_(o, linkify(i, parse_email=p)) for (o, p, i) in tests: yield _check, o, p, i def test_email_link_escaping(): tests = ( ('''''' '''"james"@example.com''', '"james"@example.com'), ('''''' '''"j'ames"@example.com''', '"j\'ames"@example.com'), ('''''' '''"ja>mes"@example.com''', '"ja>mes"@example.com'), ) def _check(o, i): eq_(o, linkify(i, parse_email=True)) for (o, i) in tests: yield _check, o, i def test_prevent_links(): """Returning None from any callback should remove links or prevent them from being created.""" def no_new_links(attrs, new=False): if new: return None return attrs def no_old_links(attrs, new=False): if not new: return None return attrs def noop(attrs, new=False): return attrs in_text = 'a ex.mp example' out_text = 'a ex.mp example' tests = ( ([noop], ('a ex.mp ' 'example'), 'noop'), ([no_new_links, noop], in_text, 'no new, noop'), ([noop, no_new_links], in_text, 'noop, no new'), ([no_old_links, noop], out_text, 'no old, noop'), ([noop, no_old_links], out_text, 'noop, no old'), ([no_old_links, no_new_links], 'a ex.mp example', 'no links'), ) def _check(cb, o, msg): eq_(o, linkify(in_text, cb), msg) for (cb, o, msg) in tests: yield _check, cb, o, msg def test_set_attrs(): """We can set random attributes on links.""" def set_attr(attrs, new=False): attrs['rev'] = 'canonical' return attrs eq_('ex.mp', linkify('ex.mp', [set_attr])) def test_only_proto_links(): """Only create links if there's a protocol.""" def only_proto(attrs, new=False): if new and not attrs['_text'].startswith(('http:', 'https:')): return None return attrs in_text = 'a ex.mp http://ex.mp bar' out_text = ('a ex.mp http://ex.mp ' 'bar') eq_(out_text, linkify(in_text, [only_proto])) def test_stop_email(): """Returning None should prevent a link from being created.""" def no_email(attrs, new=False): if attrs['href'].startswith('mailto:'): return None return attrs text = 'do not link james@example.com' eq_(text, linkify(text, parse_email=True, callbacks=[no_email])) def test_tlds(): eq_('example.com', linkify('example.com')) eq_('example.co.uk', linkify('example.co.uk')) eq_('example.edu', linkify('example.edu')) eq_('example.xxx', linkify('example.xxx')) eq_(' brie', linkify(' brie')) eq_('bit.ly/fun', linkify('bit.ly/fun')) def test_escaping(): eq_('< unrelated', linkify('< unrelated')) def test_nofollow_off(): eq_('example.com', linkify(u'example.com', [])) def test_link_in_html(): eq_('http://yy.com', linkify('http://yy.com')) eq_('http://xx.com' '', linkify('http://xx.com')) def test_links_https(): eq_('https://yy.com', linkify('https://yy.com')) def test_add_rel_nofollow(): """Verify that rel="nofollow" is added to an existing link""" eq_('http://yy.com', linkify('http://yy.com')) def test_url_with_path(): eq_('' 'http://example.com/path/to/file', linkify('http://example.com/path/to/file')) def test_link_ftp(): eq_('' 'ftp://ftp.mozilla.org/some/file', linkify('ftp://ftp.mozilla.org/some/file')) def test_link_query(): eq_('' 'http://xx.com/?test=win', linkify('http://xx.com/?test=win')) eq_('' 'xx.com/?test=win', linkify('xx.com/?test=win')) eq_('' 'xx.com?test=win', linkify('xx.com?test=win')) def test_link_fragment(): eq_('' 'http://xx.com/path#frag', linkify('http://xx.com/path#frag')) def test_link_entities(): eq_('' 'http://xx.com/?a=1&b=2', linkify('http://xx.com/?a=1&b=2')) def test_escaped_html(): """If I pass in escaped HTML, it should probably come out escaped.""" s = '<em>strong</em>' eq_(s, linkify(s)) def test_link_http_complete(): eq_('' 'https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f', linkify('https://user:pass@ftp.mozilla.org/x/y.exe?a=b&c=d&e#f')) def test_non_url(): """document.vulnerable should absolutely not be linkified.""" s = 'document.vulnerable' eq_(s, linkify(s)) def test_javascript_url(): """javascript: urls should never be linkified.""" s = 'javascript:document.vulnerable' eq_(s, linkify(s)) def test_unsafe_url(): """Any unsafe char ({}[]<>, etc.) in the path should end URL scanning.""" eq_('All your{"xx.yy.com/grover.png"}base are', linkify('All your{"xx.yy.com/grover.png"}base are')) def test_skip_pre(): """Skip linkification in
 tags."""
    simple = 'http://xx.com 
http://xx.com
' linked = ('http://xx.com ' '
http://xx.com
') all_linked = ('http://xx.com ' '
http://xx.com'
                  '
') eq_(linked, linkify(simple, skip_pre=True)) eq_(all_linked, linkify(simple)) already_linked = '
xx
' nofollowed = '
xx
' eq_(nofollowed, linkify(already_linked)) eq_(nofollowed, linkify(already_linked, skip_pre=True)) def test_libgl(): """libgl.so.1 should not be linkified.""" eq_('libgl.so.1', linkify('libgl.so.1')) def test_end_of_sentence(): """example.com. should match.""" out = u'%s%s' in_ = u'%s%s' def check(u, p): eq_(out % (u, u, p), linkify(in_ % (u, p))) tests = ( ('example.com', '.'), ('example.com', '...'), ('ex.com/foo', '.'), ('ex.com/foo', '....'), ) for u, p in tests: yield check, u, p def test_end_of_clause(): """example.com/foo, shouldn't include the ,""" eq_('ex.com/foo, bar', linkify('ex.com/foo, bar')) def test_sarcasm(): """Jokes should crash.""" dirty = u'Yeah right ' clean = u'Yeah right <sarcasm/>' eq_(clean, linkify(dirty)) def test_wrapping_parentheses(): """URLs wrapped in parantheses should not include them.""" out = u'%s%s%s' tests = ( ('(example.com)', out % ('(', 'example.com', 'example.com', ')')), ('(example.com/)', out % ('(', 'example.com/', 'example.com/', ')')), ('(example.com/foo)', out % ('(', 'example.com/foo', 'example.com/foo', ')')), ('(((example.com/))))', out % ('(((', 'example.com/)', 'example.com/)', ')))')), ('example.com/))', out % ('', 'example.com/))', 'example.com/))', '')), ('http://en.wikipedia.org/wiki/Test_(assessment)', out % ('', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', '')), ('(http://en.wikipedia.org/wiki/Test_(assessment))', out % ('(', 'en.wikipedia.org/wiki/Test_(assessment)', 'http://en.wikipedia.org/wiki/Test_(assessment)', ')')), ('((http://en.wikipedia.org/wiki/Test_(assessment))', out % ('((', 'en.wikipedia.org/wiki/Test_(assessment', 'http://en.wikipedia.org/wiki/Test_(assessment', '))')), ('(http://en.wikipedia.org/wiki/Test_(assessment)))', out % ('(', 'en.wikipedia.org/wiki/Test_(assessment))', 'http://en.wikipedia.org/wiki/Test_(assessment))', ')')), ('(http://en.wikipedia.org/wiki/)Test_(assessment', out % ('(', 'en.wikipedia.org/wiki/)Test_(assessment', 'http://en.wikipedia.org/wiki/)Test_(assessment', '')), ) def check(test, expected_output): eq_(expected_output, linkify(test)) for test, expected_output in tests: yield check, test, expected_output def test_ports(): """URLs can contain port numbers.""" tests = ( ('http://foo.com:8000', ('http://foo.com:8000', '')), ('http://foo.com:8000/', ('http://foo.com:8000/', '')), ('http://bar.com:xkcd', ('http://bar.com', ':xkcd')), ('http://foo.com:81/bar', ('http://foo.com:81/bar', '')), ('http://foo.com:', ('http://foo.com', ':')), ) def check(test, output): eq_(u'{0}{1}'.format(*output), linkify(test)) for test, output in tests: yield check, test, output def test_tokenizer(): """Linkify doesn't always have to sanitize.""" raw = 'test' eq_('test<x></x>', linkify(raw)) eq_(raw, linkify(raw, tokenizer=HTMLTokenizer)) def test_ignore_bad_protocols(): eq_('foohttp://bar', linkify('foohttp://bar')) eq_('foohttp://exampl.com', linkify('foohttp://exampl.com')) def test_max_recursion_depth(): """If we hit the max recursion depth, just return the string.""" test = '' * 2000 + 'foo' + '' * 2000 eq_(test, linkify(test)) def test_link_emails_and_urls(): """parse_email=True shouldn't prevent URLs from getting linkified.""" output = ('' 'http://example.com ' 'person@example.com') eq_(output, linkify('http://example.com person@example.com', parse_email=True)) def test_links_case_insensitive(): """Protocols and domain names are case insensitive.""" expect = ('' 'HTTP://EXAMPLE.COM') eq_(expect, linkify('HTTP://EXAMPLE.COM')) def test_elements_inside_links(): eq_(u'hello
', linkify('hello
')) eq_(u'bold hello
', linkify('bold hello
')) bleach-1.2.2/bleach/tests/test_css.py0000644000175000017500000001014712145717650017710 0ustar avtobiffavtobifffrom functools import partial from nose.tools import eq_ from bleach import clean clean = partial(clean, tags=['p'], attributes=['style']) def test_allowed_css(): tests = ( ('font-family: Arial; color: red; float: left; ' 'background-color: red;', 'color: red;', ['color']), ('border: 1px solid blue; color: red; float: left;', 'color: red;', ['color']), ('border: 1px solid blue; color: red; float: left;', 'color: red; float: left;', ['color', 'float']), ('color: red; float: left; padding: 1em;', 'color: red; float: left;', ['color', 'float']), ('color: red; float: left; padding: 1em;', 'color: red;', ['color']), ('cursor: -moz-grab;', 'cursor: -moz-grab;', ['cursor']), ('color: hsl(30,100%,50%);', 'color: hsl(30,100%,50%);', ['color']), ('color: rgba(255,0,0,0.4);', 'color: rgba(255,0,0,0.4);', ['color']), ("text-overflow: ',' ellipsis;", "text-overflow: ',' ellipsis;", ['text-overflow']), ('text-overflow: "," ellipsis;', 'text-overflow: "," ellipsis;', ['text-overflow']), ('font-family: "Arial";', 'font-family: "Arial";', ['font-family']), ) p_single = '

bar

' p_double = "

bar

" def check(i, o, s): if '"' in i: eq_(p_double % o, clean(p_double % i, styles=s)) else: eq_(p_single % o, clean(p_single % i, styles=s)) for i, o, s in tests: yield check, i, o, s def test_valid_css(): """The sanitizer should fix missing CSS values.""" styles = ['color', 'float'] eq_('

foo

', clean('

foo

', styles=styles)) eq_('

foo

', clean('

foo

', styles=styles)) def test_style_hang(): """The sanitizer should not hang on any inline styles""" # TODO: Neaten this up. It's copypasta from MDN/Kuma to repro the bug style = ("""margin-top: 0px; margin-right: 0px; margin-bottom: 1.286em; """ """margin-left: 0px; padding-top: 15px; padding-right: 15px; """ """padding-bottom: 15px; padding-left: 15px; border-top-width: """ """1px; border-right-width: 1px; border-bottom-width: 1px; """ """border-left-width: 1px; border-top-style: dotted; """ """border-right-style: dotted; border-bottom-style: dotted; """ """border-left-style: dotted; border-top-color: rgb(203, 200, """ """185); border-right-color: rgb(203, 200, 185); """ """border-bottom-color: rgb(203, 200, 185); border-left-color: """ """rgb(203, 200, 185); background-image: initial; """ """background-attachment: initial; background-origin: initial; """ """background-clip: initial; background-color: """ """rgb(246, 246, 242); overflow-x: auto; overflow-y: auto; """ """font: normal normal normal 100%/normal 'Courier New', """ """'Andale Mono', monospace; background-position: initial """ """initial; background-repeat: initial initial;""") html = '

Hello world

' % style styles = [ 'border', 'float', 'overflow', 'min-height', 'vertical-align', 'white-space', 'margin', 'margin-left', 'margin-top', 'margin-bottom', 'margin-right', 'padding', 'padding-left', 'padding-top', 'padding-bottom', 'padding-right', 'background', 'background-color', 'font', 'font-size', 'font-weight', 'text-align', 'text-transform', ] expected = ("""

""" """Hello world

""") result = clean(html, styles=styles) eq_(expected, result) bleach-1.2.2/bleach/callbacks.py0000644000175000017500000000053412145717650016635 0ustar avtobiffavtobiff"""A set of basic callbacks for bleach.linkify.""" def nofollow(attrs, new=False): if attrs['href'].startswith('mailto:'): return attrs attrs['rel'] = 'nofollow' return attrs def target_blank(attrs, new=False): if attrs['href'].startswith('mailto:'): return attrs attrs['target'] = '_blank' return attrs bleach-1.2.2/bleach/sanitizer.py0000644000175000017500000001454112145717650016731 0ustar avtobiffavtobiffimport re from xml.sax.saxutils import escape, unescape from html5lib.constants import tokenTypes from html5lib.sanitizer import HTMLSanitizerMixin from html5lib.tokenizer import HTMLTokenizer PROTOS = HTMLSanitizerMixin.acceptable_protocols PROTOS.remove('feed') class BleachSanitizerMixin(HTMLSanitizerMixin): """Mixin to replace sanitize_token() and sanitize_css().""" allowed_svg_properties = [] # TODO: When the next html5lib version comes out, nuke this. attr_val_is_uri = HTMLSanitizerMixin.attr_val_is_uri + ['poster'] def sanitize_token(self, token): """Sanitize a token either by HTML-encoding or dropping. Unlike HTMLSanitizerMixin.sanitize_token, allowed_attributes can be a dict of {'tag': ['attribute', 'pairs'], 'tag': callable}. Here callable is a function with two arguments of attribute name and value. It should return true of false. Also gives the option to strip tags instead of encoding. """ if (getattr(self, 'wildcard_attributes', None) is None and isinstance(self.allowed_attributes, dict)): self.wildcard_attributes = self.allowed_attributes.get('*', []) if token['type'] in (tokenTypes['StartTag'], tokenTypes['EndTag'], tokenTypes['EmptyTag']): if token['name'] in self.allowed_elements: if 'data' in token: if isinstance(self.allowed_attributes, dict): allowed_attributes = self.allowed_attributes.get( token['name'], []) if not callable(allowed_attributes): allowed_attributes += self.wildcard_attributes else: allowed_attributes = self.allowed_attributes attrs = dict([(name, val) for name, val in token['data'][::-1] if (allowed_attributes(name, val) if callable(allowed_attributes) else name in allowed_attributes)]) for attr in self.attr_val_is_uri: if not attr in attrs: continue val_unescaped = re.sub("[`\000-\040\177-\240\s]+", '', unescape(attrs[attr])).lower() # Remove replacement characters from unescaped # characters. val_unescaped = val_unescaped.replace(u"\ufffd", "") if (re.match(r'^[a-z0-9][-+.a-z0-9]*:', val_unescaped) and (val_unescaped.split(':')[0] not in self.allowed_protocols)): del attrs[attr] for attr in self.svg_attr_val_allows_ref: if attr in attrs: attrs[attr] = re.sub(r'url\s*\(\s*[^#\s][^)]+?\)', ' ', unescape(attrs[attr])) if (token['name'] in self.svg_allow_local_href and 'xlink:href' in attrs and re.search(r'^\s*[^#\s].*', attrs['xlink:href'])): del attrs['xlink:href'] if 'style' in attrs: attrs['style'] = self.sanitize_css(attrs['style']) token['data'] = [(name, val) for name, val in attrs.items()] return token elif self.strip_disallowed_elements: pass else: if token['type'] == tokenTypes['EndTag']: token['data'] = '' % token['name'] elif token['data']: attrs = ''.join([' %s="%s"' % (k, escape(v)) for k, v in token['data']]) token['data'] = '<%s%s>' % (token['name'], attrs) else: token['data'] = '<%s>' % token['name'] if token['selfClosing']: token['data'] = token['data'][:-1] + '/>' token['type'] = tokenTypes['Characters'] del token["name"] return token elif token['type'] == tokenTypes['Comment']: if not self.strip_html_comments: return token else: return token def sanitize_css(self, style): """HTMLSanitizerMixin.sanitize_css replacement. HTMLSanitizerMixin.sanitize_css always whitelists background-*, border-*, margin-*, and padding-*. We only whitelist what's in the whitelist. """ # disallow urls style = re.compile('url\s*\(\s*[^\s)]+?\s*\)\s*').sub(' ', style) # gauntlet # TODO: Make sure this does what it's meant to - I *think* it wants to # validate style attribute contents. parts = style.split(';') gauntlet = re.compile("""^([-/:,#%.'"\sa-zA-Z0-9!]|\w-\w|'[\s\w]+'\s*""" """|"[\s\w]+"|\([\d,%\.\s]+\))*$""") for part in parts: if not gauntlet.match(part): return '' if not re.match("^\s*([-\w]+\s*:[^:;]*(;\s*|$))*$", style): return '' clean = [] for prop, value in re.findall('([-\w]+)\s*:\s*([^:;]*)', style): if not value: continue if prop.lower() in self.allowed_css_properties: clean.append(prop + ': ' + value + ';') elif prop.lower() in self.allowed_svg_properties: clean.append(prop + ': ' + value + ';') return ' '.join(clean) class BleachSanitizer(HTMLTokenizer, BleachSanitizerMixin): def __init__(self, stream, encoding=None, parseMeta=True, useChardet=True, lowercaseElementName=True, lowercaseAttrName=True, **kwargs): HTMLTokenizer.__init__(self, stream, encoding, parseMeta, useChardet, lowercaseElementName, lowercaseAttrName, **kwargs) def __iter__(self): for token in HTMLTokenizer.__iter__(self): token = self.sanitize_token(token) if token: yield token