breadability-0.1.20/0000775000175000017500000000000012322641424015645 5ustar rhardingrharding00000000000000breadability-0.1.20/PKG-INFO0000664000175000017500000002567012322641424016754 0ustar rhardingrharding00000000000000Metadata-Version: 1.1 Name: breadability Version: 0.1.20 Summary: Port of Readability HTML parser in Python Home-page: https://github.com/bookieio/breadability Author: Rick Harding Author-email: rharding@mitechie.com License: BSD Description: breadability - another readability Python (v2.6-v3.3) port =========================================================== .. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master :target: https://travis-ci.org/bookieio/breadability.py I've tried to work with the various forks of some ancient codebase that ported `readability`_ to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it. So what's any sane person to do? Re-port it with my own repo, add some tests, infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML, but oh well I did try) This is a pretty straight port of the JS here: - http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ Alternatives ------------ - https://github.com/codelucas/newspaper - https://github.com/grangier/python-goose - https://github.com/aidanf/BTE - http://www.unixuser.org/~euske/python/webstemmer/#extract - https://github.com/al3xandru/readability.py - https://github.com/rcarmo/soup-strainer - https://github.com/bcampbell/decruft - https://github.com/gfxmonk/python-readability - https://github.com/srid/readability - https://github.com/dcramer/decruft - https://github.com/reorx/readability - https://github.com/mote/python-readability - https://github.com/predatell/python-readability-lxml - https://github.com/Harshavardhana/boilerpipy - https://github.com/raptium/hitomi - https://github.com/kingwkb/readability Installation ------------ This does depend on lxml so you'll need some C headers in order to install things from pip so that it can compile. .. code-block:: bash $ [sudo] apt-get install libxml2-dev libxslt-dev $ [sudo] pip install git+git://github.com/bookieio/breadability.git Tests ----- .. code-block:: bash $ nosetests-2.6 tests && nosetests-3.2 tests && nosetests-2.7 tests && nosetests-3.3 tests Usage ----- Command line ~~~~~~~~~~~~ .. code-block:: bash $ breadability http://wiki.python.org/moin/BeginnersGuide Options ``````` - **b** will write out the parsed content to a temp file and open it in a browser for viewing. - **d** will write out debug scoring statements to help track why a node was chosen as the document and why some nodes were removed from the final product. - **f** will override the default behaviour of getting an html fragment (
) and give you back a full document. - **v** will output in verbose debug mode and help let you know why it parsed how it did. Python API ~~~~~~~~~~ .. code-block:: python from __future__ import print_function from breadability.readable import Article if __name__ == "__main__": document = Article(html_as_text, url=source_url) print(document.readable) Work to be done --------------- Yep, I've got some catching up to do. I don't do pagination, I've got a lot of custom tweaks I need to get going, there are some articles that fail to parse. I also have more tests to write on a lot of the cleaning helpers, but hopefully things are setup in a way that those can/will be added. Fortunately, I need this library for my tools: - https://bmark.us - http://r.bmark.us so I really need this to be an active and improving project. Off the top of my heads TODO list: - Support metadata from parsed article [url, confidence scores, all candidates we thought about?] - More tests, more thorough tests - More sample articles we need to test against in the test_articles - Tests that run through and check for regressions of the test_articles - Tidy'ing the HTML that comes out, might help with regression tests ^^ - Multiple page articles - Performance tuning, we do a lot of looping and re-drop some nodes that should be skipped. We should have a set of regression tests for this so that if we implement a change that blows up performance we know it right away. - More docs for things, but sphinx docs and in code comments to help understand wtf we're doing and why. That's the biggest hurdle to some of this stuff. Inspiration ~~~~~~~~~~~ - `python-readability`_ - `decruft`_ - `readability`_ .. _readability: http://code.google.com/p/arc90labs-readability/ .. _TravisCI: http://travis-ci.org/ .. _decruft: https://github.com/dcramer/decruft .. _python-readability: https://github.com/buriy/python-readability .. :changelog: Changelog for breadability ========================== 0.1.20 (April 13th 2014) ------------------------- - Don't include tests in sdist builds. 0.1.19 (April 13th 2014) -------------------------- - Replace charade with chardet for easier packaging. 0.1.18 (April 6th 2014) ------------------------ - Improved decoding of the page into Unicode. 0.1.17 (Jan 22nd 2014) ---------------------- - More log quieting down to INFO vs WARN 0.1.16 (Jan 22nd 2014) ---------------------- - Clean up logging output at warning when it's not a true warning 0.1.15 (Nov 29th 2013) ---------------------- - Merge changes from 0.1.14 of breadability with the fork https://github.com/miso-belica/readability.py and tweaking to return to the name breadability. - Fork: Added property ``Article.main_text`` for getting text annotated with semantic HTML tags (, , ...). - Fork: Join node with 1 child of the same type. From ``
...
`` we get ``
...
``. - Fork: Don't change
to

if it contains

elements. - Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'. - Fork: Renamed package to readability. (Renamed back) - Fork: Added support for Python >= 3.2. - Fork: Py3k compatible package 'charade' is used instead of 'chardet'. 0.1.14 (Nov 7th 2013) --------------------- - Update sibling append to only happen when sibling doesn't already exist. 0.1.13 (Aug 31st 2013) ---------------------- - Give images in content boy a better chance of survival - Add tests 0.1.12 (July 28th 2013) ----------------------- - Add a user agent to requests. 0.1.11 (Dec 12th 2012) ---------------------- - Add argparse to the install requires for python < 2.7 0.1.10 (Sept 13th 2012) ----------------------- - Updated scoring bonus and penalty with , and " characters. 0.1.9 (Aug 27nd 2012) --------------------- - In case of an issue dealing with candidates we need to act like we didn't find any candidates for the article content. #10 0.1.8 (Aug 27nd 2012) --------------------- - Add code/tests for an empty document. - Fixes #9 to handle xml parsing issues. 0.1.7 (July 21nd 2012) ---------------------- - Change the encode 'replace' kwarg into a normal arg for older python version. 0.1.6 (June 17th 2012) ---------------------- - Fix the link removal, add tests and a place to process other bad links. 0.1.5 (June 16th 2012) ---------------------- - Start to look at removing bad links from content in the conditional cleaning state. This was really used for the scripting.com site's garbage. 0.1.4 (June 16th 2012) ---------------------- - Add a test generation helper readability_newtest script. - Add tests and fixes for the scripting news parse failure. 0.1.3 (June 15th 2012) ---------------------- - Add actual testing of full articles for regression tests. - Update parser to properly clean after winner doc node is chosen. 0.1.2 (May 28th 2012) --------------------- - Bugfix: #4 issue with logic of the 100char bonus points in scoring - Garden with PyLint/PEP8 - Add a bunch of tests to readable/scoring code. 0.1.1 (May 11th 2012) --------------------- - Fix bugs in scoring to help in getting right content - Add concept of -d which shows scoring/decisions on nodes - Update command line client to be able to pipe output to other tools 0.1.0 (May 6th 2012) -------------------- - Initial release and upload to PyPi Keywords: bookie,breadability,content,HTML,parsing,readability,readable Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.2 Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Software Development :: Pre-processors Classifier: Topic :: Text Processing :: Filters Classifier: Topic :: Text Processing :: Markup :: HTML breadability-0.1.20/setup.cfg0000664000175000017500000000023612322641424017467 0ustar rhardingrharding00000000000000[nosetests] with-coverage = 1 cover-package = breadability cover-erase = 1 [wheel] universal = 1 [egg_info] tag_build = tag_date = 0 tag_svn_revision = 0 breadability-0.1.20/setup.py0000664000175000017500000000504712322641326017366 0ustar rhardingrharding00000000000000import sys from os.path import ( abspath, dirname, join, ) from setuptools import setup VERSION = "0.1.20" VERSION_SUFFIX = "%d.%d" % sys.version_info[:2] CURRENT_DIRECTORY = abspath(dirname(__file__)) with open(join(CURRENT_DIRECTORY, "README.rst")) as readme: with open(join(CURRENT_DIRECTORY, "CHANGELOG.rst")) as changelog: long_description = "%s\n\n%s" % (readme.read(), changelog.read()) install_requires = [ "docopt>=0.6.1,<0.7", "chardet", "lxml>=2.0", ] tests_require = [ "nose-selecttests", "coverage", "pylint", "nose", "pep8", ] if sys.version_info < (2, 7): install_requires.append("unittest2") console_script_targets = [ "breadability = breadability.scripts.client:main", "breadability-{0} = breadability.scripts.client:main", "breadability_test = breadability.scripts.test_helper:main", "breadability_test-{0} = breadability.scripts.test_helper:main", ] console_script_targets = [ target.format(VERSION_SUFFIX) for target in console_script_targets ] setup( name="breadability", version=VERSION, description="Port of Readability HTML parser in Python", long_description=long_description, keywords=[ "bookie", "breadability", "content", "HTML", "parsing", "readability", "readable", ], author="Rick Harding", author_email="rharding@mitechie.com", url="https://github.com/bookieio/breadability", license="BSD", classifiers=[ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: Implementation :: CPython", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Pre-processors", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: Markup :: HTML", ], packages=['breadability'], include_package_data=True, zip_safe=False, install_requires=install_requires, tests_require=tests_require, test_suite="nose.collector", entry_points={ "console_scripts": console_script_targets, } ) breadability-0.1.20/CHANGELOG.rst0000664000175000017500000000653712322641374017705 0ustar rhardingrharding00000000000000.. :changelog: Changelog for breadability ========================== 0.1.20 (April 13th 2014) ------------------------- - Don't include tests in sdist builds. 0.1.19 (April 13th 2014) -------------------------- - Replace charade with chardet for easier packaging. 0.1.18 (April 6th 2014) ------------------------ - Improved decoding of the page into Unicode. 0.1.17 (Jan 22nd 2014) ---------------------- - More log quieting down to INFO vs WARN 0.1.16 (Jan 22nd 2014) ---------------------- - Clean up logging output at warning when it's not a true warning 0.1.15 (Nov 29th 2013) ---------------------- - Merge changes from 0.1.14 of breadability with the fork https://github.com/miso-belica/readability.py and tweaking to return to the name breadability. - Fork: Added property ``Article.main_text`` for getting text annotated with semantic HTML tags (, , ...). - Fork: Join node with 1 child of the same type. From ``

...
`` we get ``
...
``. - Fork: Don't change
to

if it contains

elements. - Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'. - Fork: Renamed package to readability. (Renamed back) - Fork: Added support for Python >= 3.2. - Fork: Py3k compatible package 'charade' is used instead of 'chardet'. 0.1.14 (Nov 7th 2013) --------------------- - Update sibling append to only happen when sibling doesn't already exist. 0.1.13 (Aug 31st 2013) ---------------------- - Give images in content boy a better chance of survival - Add tests 0.1.12 (July 28th 2013) ----------------------- - Add a user agent to requests. 0.1.11 (Dec 12th 2012) ---------------------- - Add argparse to the install requires for python < 2.7 0.1.10 (Sept 13th 2012) ----------------------- - Updated scoring bonus and penalty with , and " characters. 0.1.9 (Aug 27nd 2012) --------------------- - In case of an issue dealing with candidates we need to act like we didn't find any candidates for the article content. #10 0.1.8 (Aug 27nd 2012) --------------------- - Add code/tests for an empty document. - Fixes #9 to handle xml parsing issues. 0.1.7 (July 21nd 2012) ---------------------- - Change the encode 'replace' kwarg into a normal arg for older python version. 0.1.6 (June 17th 2012) ---------------------- - Fix the link removal, add tests and a place to process other bad links. 0.1.5 (June 16th 2012) ---------------------- - Start to look at removing bad links from content in the conditional cleaning state. This was really used for the scripting.com site's garbage. 0.1.4 (June 16th 2012) ---------------------- - Add a test generation helper readability_newtest script. - Add tests and fixes for the scripting news parse failure. 0.1.3 (June 15th 2012) ---------------------- - Add actual testing of full articles for regression tests. - Update parser to properly clean after winner doc node is chosen. 0.1.2 (May 28th 2012) --------------------- - Bugfix: #4 issue with logic of the 100char bonus points in scoring - Garden with PyLint/PEP8 - Add a bunch of tests to readable/scoring code. 0.1.1 (May 11th 2012) --------------------- - Fix bugs in scoring to help in getting right content - Add concept of -d which shows scoring/decisions on nodes - Update command line client to be able to pipe output to other tools 0.1.0 (May 6th 2012) -------------------- - Initial release and upload to PyPi breadability-0.1.20/breadability/0000775000175000017500000000000012322641424020300 5ustar rhardingrharding00000000000000breadability-0.1.20/breadability/scoring.py0000664000175000017500000002055412271036170022323 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- """Handle dealing with scoring nodes and content for our parsing.""" from __future__ import absolute_import from __future__ import division, print_function import re import logging from hashlib import md5 from lxml.etree import tostring from ._compat import to_bytes from .utils import normalize_whitespace # A series of sets of attributes we check to help in determining if a node is # a potential candidate or not. CLS_UNLIKELY = re.compile( "combx|comment|community|disqus|extra|foot|header|menu|remark|rss|" "shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|" "tweet|twitter|social|breadcrumb", re.IGNORECASE ) CLS_MAYBE = re.compile( "and|article|body|column|main|shadow|entry", re.IGNORECASE ) CLS_WEIGHT_POSITIVE = re.compile( "article|body|content|entry|main|page|pagination|post|text|blog|story", re.IGNORECASE ) CLS_WEIGHT_NEGATIVE = re.compile( "combx|comment|com-|contact|foot|footer|footnote|head|masthead|media|meta|" "outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|" "tool|widget", re.IGNORECASE ) logger = logging.getLogger("breadability") def check_node_attributes(pattern, node, *attributes): """ Searches match in attributes against given pattern and if finds the match against any of them returns True. """ for attribute_name in attributes: attribute = node.get(attribute_name) if attribute is not None and pattern.search(attribute): return True return False def generate_hash_id(node): """ Generates a hash_id for the node in question. :param node: lxml etree node """ try: content = tostring(node) except Exception: logger.exception("Generating of hash failed") content = to_bytes(repr(node)) hash_id = md5(content).hexdigest() return hash_id[:8] def get_link_density(node, node_text=None): """ Computes the ratio for text in given node and text in links contained in the node. It is computed from number of characters in the texts. :parameter Element node: HTML element in which links density is computed. :parameter string node_text: Text content of given node if it was obtained before. :returns float: Returns value of computed 0 <= density <= 1, where 0 means no links and 1 means that node contains only links. """ if node_text is None: node_text = node.text_content() node_text = normalize_whitespace(node_text.strip()) text_length = len(node_text) if text_length == 0: return 0.0 links_length = sum(map(_get_normalized_text_length, node.findall(".//a"))) # Give 50 bonus chars worth of length for each img. # Tweaking this 50 down a notch should help if we hit false positives. img_bonuses = 50 * len(node.findall(".//img")) links_length = max(0, links_length - img_bonuses) return links_length / text_length def _get_normalized_text_length(node): return len(normalize_whitespace(node.text_content().strip())) def get_class_weight(node): """ Computes weight of element according to its class/id. We're using sets to help efficiently check for existence of matches. """ weight = 0 if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "class"): weight -= 25 if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "class"): weight += 25 if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "id"): weight -= 25 if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "id"): weight += 25 return weight def is_unlikely_node(node): """ Short helper for checking unlikely status. If the class or id are in the unlikely list, and there's not also a class/id in the likely list then it might need to be removed. """ unlikely = check_node_attributes(CLS_UNLIKELY, node, "class", "id") maybe = check_node_attributes(CLS_MAYBE, node, "class", "id") return bool(unlikely and not maybe and node.tag != "body") def score_candidates(nodes): """Given a list of potential nodes, find some initial scores to start""" MIN_HIT_LENTH = 25 candidates = {} for node in nodes: logger.debug("* Scoring candidate %s %r", node.tag, node.attrib) # if the node has no parent it knows of then it ends up creating a # body & html tag to parent the html fragment parent = node.getparent() if parent is None: logger.debug("Skipping candidate - parent node is 'None'.") continue grand = parent.getparent() if grand is None: logger.debug("Skipping candidate - grand parent node is 'None'.") continue # if paragraph is < `MIN_HIT_LENTH` characters don't even count it inner_text = node.text_content().strip() if len(inner_text) < MIN_HIT_LENTH: logger.debug( "Skipping candidate - inner text < %d characters.", MIN_HIT_LENTH) continue # initialize readability data for the parent # add parent node if it isn't in the candidate list if parent not in candidates: candidates[parent] = ScoredNode(parent) if grand not in candidates: candidates[grand] = ScoredNode(grand) # add a point for the paragraph itself as a base content_score = 1 if inner_text: # add 0.25 points for any commas within this paragraph commas_count = inner_text.count(",") content_score += commas_count * 0.25 logger.debug("Bonus points for %d commas.", commas_count) # subtract 0.5 points for each double quote within this paragraph double_quotes_count = inner_text.count('"') content_score += double_quotes_count * -0.5 logger.debug( "Penalty points for %d double-quotes.", double_quotes_count) # for every 100 characters in this paragraph, add another point # up to 3 points length_points = len(inner_text) / 100 content_score += min(length_points, 3.0) logger.debug("Bonus points for length of text: %f", length_points) # add the score to the parent logger.debug( "Bonus points for parent %s %r with score %f: %f", parent.tag, parent.attrib, candidates[parent].content_score, content_score) candidates[parent].content_score += content_score # the grand node gets half logger.debug( "Bonus points for grand %s %r with score %f: %f", grand.tag, grand.attrib, candidates[grand].content_score, content_score / 2.0) candidates[grand].content_score += content_score / 2.0 if node not in candidates: candidates[node] = ScoredNode(node) candidates[node].content_score += content_score for candidate in candidates.values(): adjustment = 1.0 - get_link_density(candidate.node) candidate.content_score *= adjustment logger.debug( "Link density adjustment for %s %r: %f", candidate.node.tag, candidate.node.attrib, adjustment) return candidates class ScoredNode(object): """ We need Scored nodes we use to track possible article matches We might have a bunch of these so we use __slots__ to keep memory usage down. """ __slots__ = ('node', 'content_score') def __init__(self, node): """Given node, set an initial score and weigh based on css and id""" self.node = node self.content_score = 0 if node.tag in ('div', 'article'): self.content_score = 5 if node.tag in ('pre', 'td', 'blockquote'): self.content_score = 3 if node.tag in ('address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li', 'form'): self.content_score = -3 if node.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th'): self.content_score = -5 self.content_score += get_class_weight(node) @property def hash_id(self): return generate_hash_id(self.node) def __repr__(self): if self.node is None: return "" % self.content_score return "".format( self.node.tag, self.node.attrib, self.content_score ) breadability-0.1.20/breadability/__init__.py0000664000175000017500000000033212271036170022406 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import ( absolute_import, division, print_function, unicode_literals ) import pkg_resources __version__ = pkg_resources.get_distribution("breadability").version breadability-0.1.20/breadability/document.py0000664000175000017500000001010312322633151022461 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- """Generate a clean nice starting html document to process for an article.""" from __future__ import absolute_import import re import logging import chardet from lxml.etree import ( tounicode, XMLSyntaxError, ) from lxml.html import ( document_fromstring, HTMLParser, ) from ._compat import ( to_bytes, to_unicode, unicode, unicode_compatible, ) from .utils import ( cached_property, ignored, ) logger = logging.getLogger("breadability") TAG_MARK_PATTERN = re.compile(to_bytes(r"]*>\s*")) UTF8_PARSER = HTMLParser(encoding="utf8") CHARSET_META_TAG_PATTERN = re.compile( br"""]+charset=["']?([^'"/>\s]+)""", re.IGNORECASE ) def decode_html(html): """ Converts bytes stream containing an HTML page into Unicode. Tries to guess character encoding from meta tag of by "chardet" library. """ if isinstance(html, unicode): return html match = CHARSET_META_TAG_PATTERN.search(html) if match: declared_encoding = match.group(1).decode("ASCII") # proceed unknown encoding as if it wasn't found at all with ignored(LookupError): return html.decode(declared_encoding, "ignore") # try to enforce UTF-8 firstly with ignored(UnicodeDecodeError): return html.decode("utf8") text = TAG_MARK_PATTERN.sub(to_bytes(" "), html) diff = text.decode("utf8", "ignore").encode("utf8") sizes = len(diff), len(text) # 99% of text is UTF-8 if abs(len(text) - len(diff)) < max(sizes) * 0.01: return html.decode("utf8", "ignore") # try detect encoding encoding = "utf8" encoding_detector = chardet.detect(text) if encoding_detector["encoding"]: encoding = encoding_detector["encoding"] return html.decode(encoding, "ignore") BREAK_TAGS_PATTERN = re.compile( to_unicode(r"(?:<\s*[bh]r[^>]*>\s*)+"), re.IGNORECASE ) def convert_breaks_to_paragraphs(html): """ Converts


tag and multiple
tags into paragraph. """ logger.debug("Converting multiple
&
tags into

.") return BREAK_TAGS_PATTERN.sub(_replace_break_tags, html) def _replace_break_tags(match): tags = match.group() if to_unicode("

") elif tags.count(to_unicode(" 1: return to_unicode("

") else: return tags def build_document(html_content, base_href=None): """Requires that the `html_content` not be None""" assert html_content is not None if isinstance(html_content, unicode): html_content = html_content.encode("utf8", "xmlcharrefreplace") try: document = document_fromstring(html_content, parser=UTF8_PARSER) except XMLSyntaxError: raise ValueError("Failed to parse document contents.") if base_href: document.make_links_absolute(base_href, resolve_base_href=True) else: document.resolve_base_href() return document @unicode_compatible class OriginalDocument(object): """The original document to process.""" def __init__(self, html, url=None): self._html = html self._url = url @property def url(self): """Source URL of HTML document.""" return self._url def __unicode__(self): """Renders the document as a string.""" return tounicode(self.dom) @cached_property def dom(self): """Parsed HTML document from the input.""" html = self._html if not isinstance(html, unicode): html = decode_html(html) html = convert_breaks_to_paragraphs(html) document = build_document(html, self._url) return document @cached_property def links(self): """Links within the document.""" return self.dom.findall(".//a") @cached_property def title(self): """Title attribute of the parsed document.""" title_element = self.dom.find(".//title") if title_element is None or title_element.text is None: return "" else: return title_element.text.strip() breadability-0.1.20/breadability/annotated_text.py0000664000175000017500000000527212271036170023700 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from itertools import groupby from lxml.sax import saxify, ContentHandler from .utils import is_blank, shrink_text from ._compat import to_unicode _SEMANTIC_TAGS = frozenset(( "a", "abbr", "acronym", "b", "big", "blink", "blockquote", "cite", "code", "dd", "del", "dfn", "dir", "dl", "dt", "em", "h", "h1", "h2", "h3", "h4", "h5", "h6", "i", "ins", "kbd", "li", "marquee", "menu", "ol", "pre", "q", "s", "samp", "strike", "strong", "sub", "sup", "tt", "u", "ul", "var", )) class AnnotatedTextHandler(ContentHandler): """A class for converting a HTML DOM into annotated text.""" @classmethod def parse(cls, dom): """Converts DOM into paragraphs.""" handler = cls() saxify(dom, handler) return handler.content def __init__(self): self._content = [] self._paragraph = [] self._dom_path = [] @property def content(self): return self._content def startElementNS(self, name, qname, attrs): namespace, name = name if name in _SEMANTIC_TAGS: self._dom_path.append(to_unicode(name)) def endElementNS(self, name, qname): namespace, name = name if name == "p" and self._paragraph: self._append_paragraph(self._paragraph) elif name in ("ol", "ul", "pre") and self._paragraph: self._append_paragraph(self._paragraph) self._dom_path.pop() elif name in _SEMANTIC_TAGS: self._dom_path.pop() def endDocument(self): if self._paragraph: self._append_paragraph(self._paragraph) def _append_paragraph(self, paragraph): paragraph = self._process_paragraph(paragraph) self._content.append(paragraph) self._paragraph = [] def _process_paragraph(self, paragraph): current_paragraph = [] for annotation, items in groupby(paragraph, key=lambda i: i[1]): if annotation and "li" in annotation: for text, _ in items: text = shrink_text(text) current_paragraph.append((text, annotation)) else: text = "".join(i[0] for i in items) text = shrink_text(text) current_paragraph.append((text, annotation)) return tuple(current_paragraph) def characters(self, content): if is_blank(content): return if self._dom_path: pair = (content, tuple(sorted(frozenset(self._dom_path)))) else: pair = (content, None) self._paragraph.append(pair) breadability-0.1.20/breadability/_compat.py0000664000175000017500000000530412271036170022275 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from sys import version_info PY3 = version_info[0] == 3 if PY3: bytes = bytes unicode = str else: bytes = str unicode = unicode string_types = (bytes, unicode,) try: # Assert to hush pyflakes about the unused import. This is a _compat # module and we expect this to aid in other code importing urllib. import urllib2 as urllib assert urllib except ImportError: import urllib.request as urllib assert urllib def unicode_compatible(cls): """ Decorator for unicode compatible classes. Method ``__unicode__`` has to be implemented to work decorator as expected. """ if PY3: cls.__str__ = cls.__unicode__ cls.__bytes__ = lambda self: self.__str__().encode("utf8") else: cls.__str__ = lambda self: self.__unicode__().encode("utf8") return cls def to_string(object): return to_unicode(object) if PY3 else to_bytes(object) def to_bytes(object): try: if isinstance(object, bytes): return object elif isinstance(object, unicode): return object.encode("utf8") else: # try encode instance to bytes return instance_to_bytes(object) except UnicodeError: # recover from codec error and use 'repr' function return to_bytes(repr(object)) def to_unicode(object): try: if isinstance(object, unicode): return object elif isinstance(object, bytes): return object.decode("utf8") else: # try decode instance to unicode return instance_to_unicode(object) except UnicodeError: # recover from codec error and use 'repr' function return to_unicode(repr(object)) def instance_to_bytes(instance): if PY3: if hasattr(instance, "__bytes__"): return bytes(instance) elif hasattr(instance, "__str__"): return unicode(instance).encode("utf8") else: if hasattr(instance, "__str__"): return bytes(instance) elif hasattr(instance, "__unicode__"): return unicode(instance).encode("utf8") return to_bytes(repr(instance)) def instance_to_unicode(instance): if PY3: if hasattr(instance, "__str__"): return unicode(instance) elif hasattr(instance, "__bytes__"): return bytes(instance).decode("utf8") else: if hasattr(instance, "__unicode__"): return unicode(instance) elif hasattr(instance, "__str__"): return bytes(instance).decode("utf8") return to_unicode(repr(instance)) breadability-0.1.20/breadability/scripts/0000775000175000017500000000000012322641424021767 5ustar rhardingrharding00000000000000breadability-0.1.20/breadability/scripts/__init__.py0000664000175000017500000000000012271036170024065 0ustar rhardingrharding00000000000000breadability-0.1.20/breadability/scripts/client.py0000664000175000017500000000441412271036170023621 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- """ A fast python port of arc90's readability tool Usage: breadability [options] breadability --version breadability --help Arguments: URL or file path to process in readable form. Options: -f, --fragment Output html fragment by default. -b, --browser Open the parsed content in your web browser. -d, --debug Output the detailed scoring information for debugging parsing. -v, --verbose Increase logging verbosity to DEBUG. --version Display program's version number and exit. -h, --help Display this help message and exit. """ from __future__ import absolute_import from __future__ import division, print_function, unicode_literals import logging import locale import webbrowser from tempfile import NamedTemporaryFile from docopt import docopt from .. import __version__ from .._compat import urllib from ..readable import Article HEADERS = { "User-Agent": 'breadability/{version} ({url})'.format( url="https://github.com/bookieio/breadability", version=__version__ ) } def parse_args(): return docopt(__doc__, version=__version__) def main(): args = parse_args() logger = logging.getLogger("breadability") if args["--verbose"]: logger.setLevel(logging.DEBUG) resource = args[""] if resource.startswith("www"): resource = "http://" + resource url = None if resource.startswith("http://") or resource.startswith("https://"): url = resource request = urllib.Request(url, headers=HEADERS) response = urllib.urlopen(request) content = response.read() response.close() else: with open(resource, "r") as file: content = file.read() document = Article(content, url=url, return_fragment=args["--fragment"]) if args["--browser"]: html_file = NamedTemporaryFile(mode="wb", suffix=".html", delete=False) content = document.readable.encode("utf8") html_file.write(content) html_file.close() webbrowser.open(html_file.name) else: encoding = locale.getpreferredencoding() content = document.readable.encode(encoding) print(content) if __name__ == '__main__': main() breadability-0.1.20/breadability/scripts/test_helper.py0000664000175000017500000000645512320272411024663 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- """ Helper to generate a new set of article test files for breadability. Usage: breadability_test --name breadability_test --version breadability_test --help Arguments: The url of content to fetch for the article.html Options: -n , --name= Name of the test directory. --version Show program's version number and exit. -h, --help Show this help message and exit. """ from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os import mkdir from os.path import join, dirname, pardir, exists as path_exists from docopt import docopt from .. import __version__ from .._compat import to_unicode, urllib TEST_PATH = join( dirname(__file__), pardir, pardir, "tests/test_articles" ) TEST_TEMPLATE = '''# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os.path import join, dirname from breadability.readable import Article from ...compat import unittest class TestArticle(unittest.TestCase): """ Test the scoring and parsing of the article from URL below: %(source_url)s """ def setUp(self): """Load up the article for us""" article_path = join(dirname(__file__), "article.html") with open(article_path, "rb") as file: self.document = Article(file.read(), "%(source_url)s") def tearDown(self): """Drop the article""" self.document = None def test_parses(self): """Verify we can parse the document.""" self.assertIn('id="readabilityBody"', self.document.readable) def test_content_exists(self): """Verify that some content exists.""" self.assertIn("#&@#&@#&@", self.document.readable) def test_content_does_not_exist(self): """Verify we cleaned out some content that shouldn't exist.""" self.assertNotIn("", self.document.readable) ''' def parse_args(): return docopt(__doc__, version=__version__) def make_test_directory(name): """Generates a new directory for tests.""" directory_name = "test_" + name.replace(" ", "_") directory_path = join(TEST_PATH, directory_name) if not path_exists(directory_path): mkdir(directory_path) return directory_path def make_test_files(directory_path, url): init_file = join(directory_path, "__init__.py") open(init_file, "a").close() data = TEST_TEMPLATE % { "source_url": to_unicode(url) } test_file = join(directory_path, "test_article.py") with open(test_file, "w") as file: file.write(data) def fetch_article(directory_path, url): """Get the content of the url and make it the article.html""" opener = urllib.build_opener() opener.addheaders = [("Accept-Charset", "utf-8")] response = opener.open(url) html_data = response.read() response.close() path = join(directory_path, "article.html") with open(path, "wb") as file: file.write(html_data) def main(): """Run the script.""" args = parse_args() directory = make_test_directory(args["--name"]) make_test_files(directory, args[""]) fetch_article(directory, args[""]) if __name__ == "__main__": main() breadability-0.1.20/breadability/utils.py0000664000175000017500000000325712320273343022020 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals import re try: from contextlib import ignored except ImportError: from contextlib import contextmanager @contextmanager def ignored(*exceptions): try: yield except tuple(exceptions): pass MULTIPLE_WHITESPACE_PATTERN = re.compile(r"\s+", re.UNICODE) def is_blank(text): """ Returns ``True`` if string contains only whitespace characters or is empty. Otherwise ``False`` is returned. """ return not text or text.isspace() def shrink_text(text): return normalize_whitespace(text.strip()) def normalize_whitespace(text): """ Translates multiple whitespace into single space character. If there is at least one new line character chunk is replaced by single LF (Unix new line) character. """ return MULTIPLE_WHITESPACE_PATTERN.sub(_replace_whitespace, text) def _replace_whitespace(match): text = match.group() if "\n" in text or "\r" in text: return "\n" else: return " " def cached_property(getter): """ Decorator that converts a method into memoized property. The decorator works as expected only for classes with attribute '__dict__' and immutable properties. """ def decorator(self): key = "_cached_property_" + getter.__name__ if not hasattr(self, key): setattr(self, key, getter(self)) return getattr(self, key) decorator.__name__ = getter.__name__ decorator.__module__ = getter.__module__ decorator.__doc__ = getter.__doc__ return property(decorator) breadability-0.1.20/breadability/readable.py0000664000175000017500000003620712320272411022413 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import import logging from copy import deepcopy from operator import attrgetter from pprint import PrettyPrinter from lxml.html.clean import Cleaner from lxml.etree import tounicode, tostring from lxml.html import fragment_fromstring, fromstring from .document import OriginalDocument from .annotated_text import AnnotatedTextHandler from .scoring import ( get_class_weight, get_link_density, is_unlikely_node, score_candidates, ) from .utils import cached_property, shrink_text html_cleaner = Cleaner( scripts=True, javascript=True, comments=True, style=True, links=True, meta=False, add_nofollow=False, page_structure=False, processing_instructions=True, embedded=False, frames=False, forms=False, annoying_tags=False, remove_tags=None, kill_tags=("noscript", "iframe"), remove_unknown_tags=False, safe_attrs_only=False) SCORABLE_TAGS = ("div", "p", "td", "pre", "article") ANNOTATION_TAGS = ( "a", "abbr", "acronym", "b", "big", "blink", "blockquote", "br", "cite", "code", "dd", "del", "dir", "dl", "dt", "em", "font", "h", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "i", "ins", "kbd", "li", "marquee", "menu", "ol", "p", "pre", "q", "s", "samp", "span", "strike", "strong", "sub", "sup", "tt", "u", "ul", "var", ) NULL_DOCUMENT = """ """ logger = logging.getLogger("breadability") def ok_embedded_video(node): """Check if this embed/video is an ok one to count.""" good_keywords = ('youtube', 'blip.tv', 'vimeo') node_str = tounicode(node) for key in good_keywords: if key in node_str: return True return False def build_base_document(dom, return_fragment=True): """ Builds a base document with the body as root. :param dom: Parsed lxml tree (Document Object Model). :param bool return_fragment: If True only

fragment is returned. Otherwise full HTML document is returned. """ body_element = dom.find(".//body") if body_element is None: fragment = fragment_fromstring('
') fragment.append(dom) else: body_element.tag = "div" body_element.set("id", "readabilityBody") fragment = body_element return document_from_fragment(fragment, return_fragment) def build_error_document(dom, return_fragment=True): """ Builds an empty erorr document with the body as root. :param bool return_fragment: If True only
fragment is returned. Otherwise full HTML document is returned. """ fragment = fragment_fromstring( '
') return document_from_fragment(fragment, return_fragment) def document_from_fragment(fragment, return_fragment): if return_fragment: document = fragment else: document = fromstring(NULL_DOCUMENT) body_element = document.find(".//body") body_element.append(fragment) document.doctype = "" return document def check_siblings(candidate_node, candidate_list): """ Looks through siblings for content that might also be related. Things like preambles, content split by ads that we removed, etc. """ candidate_css = candidate_node.node.get("class") potential_target = candidate_node.content_score * 0.2 sibling_target_score = potential_target if potential_target > 10 else 10 parent = candidate_node.node.getparent() siblings = parent.getchildren() if parent is not None else [] for sibling in siblings: append = False content_bonus = 0 if sibling is candidate_node.node: append = True # Give a bonus if sibling nodes and top candidates have the example # same class name if candidate_css and sibling.get("class") == candidate_css: content_bonus += candidate_node.content_score * 0.2 if sibling in candidate_list: adjusted_score = \ candidate_list[sibling].content_score + content_bonus if adjusted_score >= sibling_target_score: append = True if sibling.tag == "p": link_density = get_link_density(sibling) content = sibling.text_content() content_length = len(content) if content_length > 80 and link_density < 0.25: append = True elif content_length < 80 and link_density == 0: if ". " in content: append = True if append: logger.debug( "Sibling appended: %s %r", sibling.tag, sibling.attrib) if sibling.tag not in ("div", "p"): # We have a node that isn't a common block level element, like # a form or td tag. Turn it into a div so it doesn't get # filtered out later by accident. sibling.tag = "div" if candidate_node.node != sibling: candidate_node.node.append(sibling) return candidate_node def clean_document(node): """Cleans up the final document we return as the readable article.""" if node is None or len(node) == 0: return None logger.debug("\n\n-------------- CLEANING DOCUMENT -----------------") to_drop = [] for n in node.iter(): # clean out any in-line style properties if "style" in n.attrib: n.set("style", "") # remove embended objects unless it's wanted video if n.tag in ("object", "embed") and not ok_embedded_video(n): logger.debug("Dropping node %s %r", n.tag, n.attrib) to_drop.append(n) # clean headings with bad css or high link density if n.tag in ("h1", "h2", "h3", "h4") and get_class_weight(n) < 0: logger.debug("Dropping <%s>, it's insignificant", n.tag) to_drop.append(n) if n.tag in ("h3", "h4") and get_link_density(n) > 0.33: logger.debug("Dropping <%s>, it's insignificant", n.tag) to_drop.append(n) # drop block element without content and children if n.tag in ("div", "p"): text_content = shrink_text(n.text_content()) if len(text_content) < 5 and not n.getchildren(): logger.debug( "Dropping %s %r without content.", n.tag, n.attrib) to_drop.append(n) # finally try out the conditional cleaning of the target node if clean_conditionally(n): to_drop.append(n) drop_nodes_with_parents(to_drop) return node def drop_nodes_with_parents(nodes): for node in nodes: if node.getparent() is None: continue node.drop_tree() logger.debug( "Dropped node with parent %s %r %s", node.tag, node.attrib, node.text_content()[:50] ) def clean_conditionally(node): """Remove the clean_el if it looks like bad content based on rules.""" if node.tag not in ('form', 'table', 'ul', 'div', 'p'): return # this is not the tag we are looking for weight = get_class_weight(node) # content_score = LOOK up the content score for this node we found # before else default to 0 content_score = 0 if weight + content_score < 0: logger.debug('Dropping conditional node') logger.debug('Weight + score < 0') return True commas_count = node.text_content().count(',') if commas_count < 10: logger.debug( "There are %d commas so we're processing more.", commas_count) # If there are not very many commas, and the number of # non-paragraph elements is more than paragraphs or other ominous # signs, remove the element. p = len(node.findall('.//p')) img = len(node.findall('.//img')) li = len(node.findall('.//li')) - 100 inputs = len(node.findall('.//input')) embed = 0 embeds = node.findall('.//embed') for e in embeds: if ok_embedded_video(e): embed += 1 link_density = get_link_density(node) content_length = len(node.text_content()) remove_node = False if li > p and node.tag != 'ul' and node.tag != 'ol': logger.debug('Conditional drop: li > p and not ul/ol') remove_node = True elif inputs > p / 3.0: logger.debug('Conditional drop: inputs > p/3.0') remove_node = True elif content_length < 25 and (img == 0 or img > 2): logger.debug('Conditional drop: len < 25 and 0/>2 images') remove_node = True elif weight < 25 and link_density > 0.2: logger.debug('Conditional drop: weight small (%f) and link is dense (%f)', weight, link_density) remove_node = True elif weight >= 25 and link_density > 0.5: logger.debug('Conditional drop: weight big but link heavy') remove_node = True elif (embed == 1 and content_length < 75) or embed > 1: logger.debug( 'Conditional drop: embed w/o much content or many embed') remove_node = True if remove_node: logger.debug('Node will be removed: %s %r %s', node.tag, node.attrib, node.text_content()[:30]) return remove_node return False # nope, don't remove anything def prep_article(doc): """Once we've found our target article we want to clean it up. Clean out: - inline styles - forms - strip empty

- extra tags """ return clean_document(doc) def find_candidates(document): """ Finds cadidate nodes for the readable version of the article. Here's we're going to remove unlikely nodes, find scores on the rest, clean up and return the final best match. """ nodes_to_score = set() should_remove = set() for node in document.iter(): if is_unlikely_node(node): logger.debug( "We should drop unlikely: %s %r", node.tag, node.attrib) should_remove.add(node) elif is_bad_link(node): logger.debug( "We should drop bad link: %s %r", node.tag, node.attrib) should_remove.add(node) elif node.tag in SCORABLE_TAGS: nodes_to_score.add(node) return score_candidates(nodes_to_score), should_remove def is_bad_link(node): """ Helper to determine if the node is link that is useless. We've hit articles with many multiple links that should be cleaned out because they're just there to pollute the space. See tests for examples. """ if node.tag != "a": return False name = node.get("name") href = node.get("href") if name and not href: return True if href: href_parts = href.split("#") if len(href_parts) == 2 and len(href_parts[1]) > 25: return True return False class Article(object): """Parsed readable object""" def __init__(self, html, url=None, return_fragment=True): """ Create the Article we're going to use. :param html: The string of HTML we're going to parse. :param url: The url so we can adjust the links to still work. :param return_fragment: Should we return a

fragment or a full document. """ self._original_document = OriginalDocument(html, url=url) self._return_fragment = return_fragment def __str__(self): return tostring(self._readable()) def __unicode__(self): return tounicode(self._readable()) @cached_property def dom(self): """Parsed lxml tree (Document Object Model) of the given html.""" try: dom = self._original_document.dom # cleaning doesn't return, just wipes in place html_cleaner(dom) return leaf_div_elements_into_paragraphs(dom) except ValueError: return None @cached_property def candidates(self): """Generates list of candidates from the DOM.""" dom = self.dom if dom is None or len(dom) == 0: return None candidates, unlikely_candidates = find_candidates(dom) drop_nodes_with_parents(unlikely_candidates) return candidates @cached_property def main_text(self): dom = deepcopy(self.readable_dom).get_element_by_id("readabilityBody") return AnnotatedTextHandler.parse(dom) @cached_property def readable(self): return tounicode(self.readable_dom) @cached_property def readable_dom(self): return self._readable() def _readable(self): """The readable parsed article""" if not self.candidates: logger.info("No candidates found in document.") return self._handle_no_candidates() # right now we return the highest scoring candidate content best_candidates = sorted( (c for c in self.candidates.values()), key=attrgetter("content_score"), reverse=True) printer = PrettyPrinter(indent=2) logger.debug(printer.pformat(best_candidates)) # since we have several candidates, check the winner's siblings # for extra content winner = best_candidates[0] updated_winner = check_siblings(winner, self.candidates) updated_winner.node = prep_article(updated_winner.node) if updated_winner.node is not None: dom = build_base_document( updated_winner.node, self._return_fragment) else: logger.info( 'Had candidates but failed to find a cleaned winning DOM.') dom = self._handle_no_candidates() return self._remove_orphans(dom.get_element_by_id("readabilityBody")) def _remove_orphans(self, dom): for node in dom.iterdescendants(): if len(node) == 1 and tuple(node)[0].tag == node.tag: node.drop_tag() return dom def _handle_no_candidates(self): """ If we fail to find a good candidate we need to find something else. """ # since we've not found a good candidate we're should help this if self.dom is not None and len(self.dom): dom = prep_article(self.dom) dom = build_base_document(dom, self._return_fragment) return self._remove_orphans( dom.get_element_by_id("readabilityBody")) else: logger.info("No document to use.") return build_error_document(self._return_fragment) def leaf_div_elements_into_paragraphs(document): """ Turn some block elements that don't have children block level elements into

elements. Since we can't change the tree as we iterate over it, we must do this before we process our document. """ for element in document.iter(tag="div"): child_tags = tuple(n.tag for n in element.getchildren()) if "div" not in child_tags and "p" not in child_tags: logger.debug( "Changing leaf block element <%s> into

", element.tag) element.tag = "p" return document breadability-0.1.20/AUTHORS.txt0000664000175000017500000000010312271036170017524 0ustar rhardingrharding00000000000000Rick Harding (original author) nhnifong Craig Maloney Mišo Belica breadability-0.1.20/README.rst0000664000175000017500000001076312271036170017342 0ustar rhardingrharding00000000000000breadability - another readability Python (v2.6-v3.3) port =========================================================== .. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master :target: https://travis-ci.org/bookieio/breadability.py I've tried to work with the various forks of some ancient codebase that ported `readability`_ to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it. So what's any sane person to do? Re-port it with my own repo, add some tests, infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML, but oh well I did try) This is a pretty straight port of the JS here: - http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ Alternatives ------------ - https://github.com/codelucas/newspaper - https://github.com/grangier/python-goose - https://github.com/aidanf/BTE - http://www.unixuser.org/~euske/python/webstemmer/#extract - https://github.com/al3xandru/readability.py - https://github.com/rcarmo/soup-strainer - https://github.com/bcampbell/decruft - https://github.com/gfxmonk/python-readability - https://github.com/srid/readability - https://github.com/dcramer/decruft - https://github.com/reorx/readability - https://github.com/mote/python-readability - https://github.com/predatell/python-readability-lxml - https://github.com/Harshavardhana/boilerpipy - https://github.com/raptium/hitomi - https://github.com/kingwkb/readability Installation ------------ This does depend on lxml so you'll need some C headers in order to install things from pip so that it can compile. .. code-block:: bash $ [sudo] apt-get install libxml2-dev libxslt-dev $ [sudo] pip install git+git://github.com/bookieio/breadability.git Tests ----- .. code-block:: bash $ nosetests-2.6 tests && nosetests-3.2 tests && nosetests-2.7 tests && nosetests-3.3 tests Usage ----- Command line ~~~~~~~~~~~~ .. code-block:: bash $ breadability http://wiki.python.org/moin/BeginnersGuide Options ``````` - **b** will write out the parsed content to a temp file and open it in a browser for viewing. - **d** will write out debug scoring statements to help track why a node was chosen as the document and why some nodes were removed from the final product. - **f** will override the default behaviour of getting an html fragment (

) and give you back a full document. - **v** will output in verbose debug mode and help let you know why it parsed how it did. Python API ~~~~~~~~~~ .. code-block:: python from __future__ import print_function from breadability.readable import Article if __name__ == "__main__": document = Article(html_as_text, url=source_url) print(document.readable) Work to be done --------------- Yep, I've got some catching up to do. I don't do pagination, I've got a lot of custom tweaks I need to get going, there are some articles that fail to parse. I also have more tests to write on a lot of the cleaning helpers, but hopefully things are setup in a way that those can/will be added. Fortunately, I need this library for my tools: - https://bmark.us - http://r.bmark.us so I really need this to be an active and improving project. Off the top of my heads TODO list: - Support metadata from parsed article [url, confidence scores, all candidates we thought about?] - More tests, more thorough tests - More sample articles we need to test against in the test_articles - Tests that run through and check for regressions of the test_articles - Tidy'ing the HTML that comes out, might help with regression tests ^^ - Multiple page articles - Performance tuning, we do a lot of looping and re-drop some nodes that should be skipped. We should have a set of regression tests for this so that if we implement a change that blows up performance we know it right away. - More docs for things, but sphinx docs and in code comments to help understand wtf we're doing and why. That's the biggest hurdle to some of this stuff. Inspiration ~~~~~~~~~~~ - `python-readability`_ - `decruft`_ - `readability`_ .. _readability: http://code.google.com/p/arc90labs-readability/ .. _TravisCI: http://travis-ci.org/ .. _decruft: https://github.com/dcramer/decruft .. _python-readability: https://github.com/buriy/python-readability breadability-0.1.20/LICENSE.rst0000664000175000017500000000243112271036170017460 0ustar rhardingrharding00000000000000Copyright (c) 2013 Rick Harding and contributors All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: - Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. - Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. breadability-0.1.20/MANIFEST.in0000664000175000017500000000012112271036170017374 0ustar rhardingrharding00000000000000include README.rst include CHANGELOG.rst include LICENSE.rst include AUTHORS.txt breadability-0.1.20/breadability.egg-info/0000775000175000017500000000000012322641424021772 5ustar rhardingrharding00000000000000breadability-0.1.20/breadability.egg-info/PKG-INFO0000664000175000017500000002567012322641424023101 0ustar rhardingrharding00000000000000Metadata-Version: 1.1 Name: breadability Version: 0.1.20 Summary: Port of Readability HTML parser in Python Home-page: https://github.com/bookieio/breadability Author: Rick Harding Author-email: rharding@mitechie.com License: BSD Description: breadability - another readability Python (v2.6-v3.3) port =========================================================== .. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master :target: https://travis-ci.org/bookieio/breadability.py I've tried to work with the various forks of some ancient codebase that ported `readability`_ to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it. So what's any sane person to do? Re-port it with my own repo, add some tests, infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML, but oh well I did try) This is a pretty straight port of the JS here: - http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ Alternatives ------------ - https://github.com/codelucas/newspaper - https://github.com/grangier/python-goose - https://github.com/aidanf/BTE - http://www.unixuser.org/~euske/python/webstemmer/#extract - https://github.com/al3xandru/readability.py - https://github.com/rcarmo/soup-strainer - https://github.com/bcampbell/decruft - https://github.com/gfxmonk/python-readability - https://github.com/srid/readability - https://github.com/dcramer/decruft - https://github.com/reorx/readability - https://github.com/mote/python-readability - https://github.com/predatell/python-readability-lxml - https://github.com/Harshavardhana/boilerpipy - https://github.com/raptium/hitomi - https://github.com/kingwkb/readability Installation ------------ This does depend on lxml so you'll need some C headers in order to install things from pip so that it can compile. .. code-block:: bash $ [sudo] apt-get install libxml2-dev libxslt-dev $ [sudo] pip install git+git://github.com/bookieio/breadability.git Tests ----- .. code-block:: bash $ nosetests-2.6 tests && nosetests-3.2 tests && nosetests-2.7 tests && nosetests-3.3 tests Usage ----- Command line ~~~~~~~~~~~~ .. code-block:: bash $ breadability http://wiki.python.org/moin/BeginnersGuide Options ``````` - **b** will write out the parsed content to a temp file and open it in a browser for viewing. - **d** will write out debug scoring statements to help track why a node was chosen as the document and why some nodes were removed from the final product. - **f** will override the default behaviour of getting an html fragment (
) and give you back a full document. - **v** will output in verbose debug mode and help let you know why it parsed how it did. Python API ~~~~~~~~~~ .. code-block:: python from __future__ import print_function from breadability.readable import Article if __name__ == "__main__": document = Article(html_as_text, url=source_url) print(document.readable) Work to be done --------------- Yep, I've got some catching up to do. I don't do pagination, I've got a lot of custom tweaks I need to get going, there are some articles that fail to parse. I also have more tests to write on a lot of the cleaning helpers, but hopefully things are setup in a way that those can/will be added. Fortunately, I need this library for my tools: - https://bmark.us - http://r.bmark.us so I really need this to be an active and improving project. Off the top of my heads TODO list: - Support metadata from parsed article [url, confidence scores, all candidates we thought about?] - More tests, more thorough tests - More sample articles we need to test against in the test_articles - Tests that run through and check for regressions of the test_articles - Tidy'ing the HTML that comes out, might help with regression tests ^^ - Multiple page articles - Performance tuning, we do a lot of looping and re-drop some nodes that should be skipped. We should have a set of regression tests for this so that if we implement a change that blows up performance we know it right away. - More docs for things, but sphinx docs and in code comments to help understand wtf we're doing and why. That's the biggest hurdle to some of this stuff. Inspiration ~~~~~~~~~~~ - `python-readability`_ - `decruft`_ - `readability`_ .. _readability: http://code.google.com/p/arc90labs-readability/ .. _TravisCI: http://travis-ci.org/ .. _decruft: https://github.com/dcramer/decruft .. _python-readability: https://github.com/buriy/python-readability .. :changelog: Changelog for breadability ========================== 0.1.20 (April 13th 2014) ------------------------- - Don't include tests in sdist builds. 0.1.19 (April 13th 2014) -------------------------- - Replace charade with chardet for easier packaging. 0.1.18 (April 6th 2014) ------------------------ - Improved decoding of the page into Unicode. 0.1.17 (Jan 22nd 2014) ---------------------- - More log quieting down to INFO vs WARN 0.1.16 (Jan 22nd 2014) ---------------------- - Clean up logging output at warning when it's not a true warning 0.1.15 (Nov 29th 2013) ---------------------- - Merge changes from 0.1.14 of breadability with the fork https://github.com/miso-belica/readability.py and tweaking to return to the name breadability. - Fork: Added property ``Article.main_text`` for getting text annotated with semantic HTML tags (, , ...). - Fork: Join node with 1 child of the same type. From ``
...
`` we get ``
...
``. - Fork: Don't change
to

if it contains

elements. - Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'. - Fork: Renamed package to readability. (Renamed back) - Fork: Added support for Python >= 3.2. - Fork: Py3k compatible package 'charade' is used instead of 'chardet'. 0.1.14 (Nov 7th 2013) --------------------- - Update sibling append to only happen when sibling doesn't already exist. 0.1.13 (Aug 31st 2013) ---------------------- - Give images in content boy a better chance of survival - Add tests 0.1.12 (July 28th 2013) ----------------------- - Add a user agent to requests. 0.1.11 (Dec 12th 2012) ---------------------- - Add argparse to the install requires for python < 2.7 0.1.10 (Sept 13th 2012) ----------------------- - Updated scoring bonus and penalty with , and " characters. 0.1.9 (Aug 27nd 2012) --------------------- - In case of an issue dealing with candidates we need to act like we didn't find any candidates for the article content. #10 0.1.8 (Aug 27nd 2012) --------------------- - Add code/tests for an empty document. - Fixes #9 to handle xml parsing issues. 0.1.7 (July 21nd 2012) ---------------------- - Change the encode 'replace' kwarg into a normal arg for older python version. 0.1.6 (June 17th 2012) ---------------------- - Fix the link removal, add tests and a place to process other bad links. 0.1.5 (June 16th 2012) ---------------------- - Start to look at removing bad links from content in the conditional cleaning state. This was really used for the scripting.com site's garbage. 0.1.4 (June 16th 2012) ---------------------- - Add a test generation helper readability_newtest script. - Add tests and fixes for the scripting news parse failure. 0.1.3 (June 15th 2012) ---------------------- - Add actual testing of full articles for regression tests. - Update parser to properly clean after winner doc node is chosen. 0.1.2 (May 28th 2012) --------------------- - Bugfix: #4 issue with logic of the 100char bonus points in scoring - Garden with PyLint/PEP8 - Add a bunch of tests to readable/scoring code. 0.1.1 (May 11th 2012) --------------------- - Fix bugs in scoring to help in getting right content - Add concept of -d which shows scoring/decisions on nodes - Update command line client to be able to pipe output to other tools 0.1.0 (May 6th 2012) -------------------- - Initial release and upload to PyPi Keywords: bookie,breadability,content,HTML,parsing,readability,readable Platform: UNKNOWN Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 2 Classifier: Programming Language :: Python :: 2.6 Classifier: Programming Language :: Python :: 2.7 Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.2 Classifier: Programming Language :: Python :: 3.3 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Topic :: Internet :: WWW/HTTP Classifier: Topic :: Software Development :: Pre-processors Classifier: Topic :: Text Processing :: Filters Classifier: Topic :: Text Processing :: Markup :: HTML breadability-0.1.20/breadability.egg-info/dependency_links.txt0000664000175000017500000000000112322641424026040 0ustar rhardingrharding00000000000000 breadability-0.1.20/breadability.egg-info/top_level.txt0000664000175000017500000000001512322641424024520 0ustar rhardingrharding00000000000000breadability breadability-0.1.20/breadability.egg-info/SOURCES.txt0000664000175000017500000000265112322641424023662 0ustar rhardingrharding00000000000000AUTHORS.txt CHANGELOG.rst LICENSE.rst MANIFEST.in README.rst setup.cfg setup.py breadability/__init__.py breadability/_compat.py breadability/annotated_text.py breadability/document.py breadability/readable.py breadability/scoring.py breadability/utils.py breadability.egg-info/PKG-INFO breadability.egg-info/SOURCES.txt breadability.egg-info/dependency_links.txt breadability.egg-info/entry_points.txt breadability.egg-info/not-zip-safe breadability.egg-info/requires.txt breadability.egg-info/top_level.txt breadability/scripts/__init__.py breadability/scripts/client.py breadability/scripts/test_helper.py tests/__init__.py tests/compat.py tests/test_annotated_text.py tests/test_orig_document.py tests/test_readable.py tests/test_scoring.py tests/utils.py tests/test_articles/__init__.py tests/test_articles/test_antipope_org/__init__.py tests/test_articles/test_antipope_org/test_article.py tests/test_articles/test_businessinsider-com/__init__.py tests/test_articles/test_businessinsider-com/test_article.py tests/test_articles/test_businessinsider_com/__init__.py tests/test_articles/test_businessinsider_com/test_article.py tests/test_articles/test_cz_zdrojak_tests/__init__.py tests/test_articles/test_cz_zdrojak_tests/test_article.py tests/test_articles/test_scripting_com/__init__.py tests/test_articles/test_scripting_com/test_article.py tests/test_articles/test_sweetshark/__init__.py tests/test_articles/test_sweetshark/test_article.pybreadability-0.1.20/breadability.egg-info/not-zip-safe0000664000175000017500000000000112320272414024215 0ustar rhardingrharding00000000000000 breadability-0.1.20/breadability.egg-info/entry_points.txt0000664000175000017500000000035712322641424025275 0ustar rhardingrharding00000000000000[console_scripts] breadability = breadability.scripts.client:main breadability-2.7 = breadability.scripts.client:main breadability_test = breadability.scripts.test_helper:main breadability_test-2.7 = breadability.scripts.test_helper:main breadability-0.1.20/breadability.egg-info/requires.txt0000664000175000017500000000004412322641424024370 0ustar rhardingrharding00000000000000docopt>=0.6.1,<0.7 chardet lxml>=2.0breadability-0.1.20/tests/0000775000175000017500000000000012322641424017007 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/compat.py0000664000175000017500000000032012271036170020636 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals try: import unittest2 as unittest except ImportError: import unittest breadability-0.1.20/tests/test_scoring.py0000664000175000017500000002454112271036170022071 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals import re from operator import attrgetter from lxml.html import document_fromstring from lxml.html import fragment_fromstring from breadability.readable import Article from breadability.scoring import ( check_node_attributes, generate_hash_id, get_class_weight, score_candidates, ScoredNode, ) from breadability.readable import ( get_link_density, is_unlikely_node, ) from .compat import unittest from .utils import load_snippet class TestHashId(unittest.TestCase): def test_generate_hash(self): dom = fragment_fromstring("

ľščťžýáí
") generate_hash_id(dom) def test_hash_from_id_on_exception(self): generate_hash_id(None) def test_different_hashes(self): dom = fragment_fromstring("
ľščťžýáí
") hash_dom = generate_hash_id(dom) hash_none = generate_hash_id(None) self.assertNotEqual(hash_dom, hash_none) def test_equal_hashes(self): dom1 = fragment_fromstring("
ľščťžýáí
") dom2 = fragment_fromstring("
ľščťžýáí
") hash_dom1 = generate_hash_id(dom1) hash_dom2 = generate_hash_id(dom2) self.assertEqual(hash_dom1, hash_dom2) hash_none1 = generate_hash_id(None) hash_none2 = generate_hash_id(None) self.assertEqual(hash_none1, hash_none2) class TestCheckNodeAttr(unittest.TestCase): """Verify a node has a class/id in the given set. The idea is that we have sets of known good/bad ids and classes and need to verify the given node does/doesn't have those classes/ids. """ def test_has_class(self): """Verify that a node has a class in our set.""" test_pattern = re.compile('test1|test2', re.I) test_node = fragment_fromstring('
') test_node.set('class', 'test2 comment') self.assertTrue( check_node_attributes(test_pattern, test_node, 'class')) def test_has_id(self): """Verify that a node has an id in our set.""" test_pattern = re.compile('test1|test2', re.I) test_node = fragment_fromstring('
') test_node.set('id', 'test2') self.assertTrue(check_node_attributes(test_pattern, test_node, 'id')) def test_lacks_class(self): """Verify that a node does not have a class in our set.""" test_pattern = re.compile('test1|test2', re.I) test_node = fragment_fromstring('
') test_node.set('class', 'test4 comment') self.assertFalse( check_node_attributes(test_pattern, test_node, 'class')) def test_lacks_id(self): """Verify that a node does not have an id in our set.""" test_pattern = re.compile('test1|test2', re.I) test_node = fragment_fromstring('
') test_node.set('id', 'test4') self.assertFalse(check_node_attributes(test_pattern, test_node, 'id')) class TestLinkDensity(unittest.TestCase): """Verify we calc our link density correctly.""" def test_empty_node(self): """An empty node doesn't have much of a link density""" doc = Article("
") self.assertEqual(get_link_density(doc.readable_dom), 0.0) def test_small_doc_no_links(self): doc = Article(load_snippet('document_min.html')) self.assertEqual(get_link_density(doc.readable_dom), 0.0) def test_several_links(self): """This doc has a 3 links with the majority of content.""" doc = Article(load_snippet('document_absolute_url.html')) self.assertAlmostEqual(get_link_density(doc.readable_dom), 22/37) class TestClassWeight(unittest.TestCase): """Verify we score nodes correctly based on their class/id attributes.""" def test_no_matches_zero(self): """If you don't have the attribute then you get a weight of 0""" node = fragment_fromstring("
") self.assertEqual(get_class_weight(node), 0) def test_id_hits(self): """If the id is in the list then it gets a weight""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), 25) test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), -25) def test_class_hits(self): """If the class is in the list then it gets a weight""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), 25) test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), -25) def test_scores_collide(self): """We might hit both positive and negative scores. Positive and negative scoring is done independently so it's possible to hit both positive and negative scores and cancel each other out. """ test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), 0) test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), 25) def test_scores_only_once(self): """Scoring is not cumulative within a class hit.""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertEqual(get_class_weight(node), 25) class TestUnlikelyNode(unittest.TestCase): """is_unlikely_node should help verify our node is good/bad.""" def test_body_is_always_likely(self): """The body tag is always a likely node.""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertFalse(is_unlikely_node(node)) def test_is_unlikely(self): "Keywords in the class/id will make us believe this is unlikely." test_div = '
Content
' node = fragment_fromstring(test_div) self.assertTrue(is_unlikely_node(node)) test_div = '
Content
' node = fragment_fromstring(test_div) self.assertTrue(is_unlikely_node(node)) def test_not_unlikely(self): """Suck it double negatives.""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertFalse(is_unlikely_node(node)) test_div = '
Content
' node = fragment_fromstring(test_div) self.assertFalse(is_unlikely_node(node)) def test_maybe_hits(self): """We've got some maybes that will overrule an unlikely node.""" test_div = '
Content
' node = fragment_fromstring(test_div) self.assertFalse(is_unlikely_node(node)) class TestScoredNode(unittest.TestCase): """ScoredNodes constructed have initial content_scores, etc.""" def test_hash_id(self): """ScoredNodes have a hash_id based on their content Since this is based on the html there are chances for collisions, but it helps us follow and identify nodes through the scoring process. Two identical nodes would score the same, so meh all good. """ test_div = '
Content
' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.hash_id, 'ffa4c519') def test_div_content_score(self): """A div starts out with a score of 5 and modifies from there""" test_div = '
Content
' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.content_score, 5) test_div = '
Content
' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.content_score, 30) test_div = '
Content
' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.content_score, -20) def test_headings_score(self): """Heading tags aren't likely candidates, hurt their scores.""" test_div = '

Heading

' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.content_score, -5) def test_list_items(self): """Heading tags aren't likely candidates, hurt their scores.""" test_div = '
  • list item
  • ' node = fragment_fromstring(test_div) snode = ScoredNode(node) self.assertEqual(snode.content_score, -3) class TestScoreCandidates(unittest.TestCase): """The grand daddy of tests to make sure our scoring works Now scoring details will change over time, so the most important thing is to make sure candidates come out in the right order, not necessarily how they scored. Make sure to keep this in mind while getting tests going. """ def test_simple_candidate_set(self): """Tests a simple case of two candidate nodes""" html = """

    This is a great amount of info

    And more content Home

    """ dom = document_fromstring(html) div_nodes = dom.findall(".//div") candidates = score_candidates(div_nodes) ordered = sorted( (c for c in candidates.values()), reverse=True, key=attrgetter("content_score")) self.assertEqual(ordered[0].node.tag, "div") self.assertEqual(ordered[0].node.attrib["class"], "content") self.assertEqual(ordered[1].node.tag, "body") self.assertEqual(ordered[2].node.tag, "html") self.assertEqual(ordered[3].node.tag, "div") self.assertEqual(ordered[3].node.attrib["class"], "footer") breadability-0.1.20/tests/__init__.py0000664000175000017500000000000012271036170021105 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/0000775000175000017500000000000012322641424021654 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/__init__.py0000664000175000017500000000000012271036170023752 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_scripting_com/0000775000175000017500000000000012322641424025553 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_scripting_com/__init__.py0000664000175000017500000000000012271036170027651 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_scripting_com/test_article.py0000664000175000017500000000455112320272411030606 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import ( absolute_import, division, print_function, unicode_literals ) import os from operator import attrgetter from breadability.readable import Article from breadability.readable import check_siblings from breadability.readable import prep_article from ...compat import unittest class TestArticle(unittest.TestCase): """Test the scoring and parsing of the Article""" def setUp(self): """Load up the article for us""" article_path = os.path.join(os.path.dirname(__file__), 'article.html') self.article = open(article_path).read() def tearDown(self): """Drop the article""" self.article = None def test_parses(self): """Verify we can parse the document.""" doc = Article(self.article) self.assertTrue('id="readabilityBody"' in doc.readable) def test_content_exists(self): """Verify that some content exists.""" doc = Article(self.article) self.assertTrue('Amazon and Google' in doc.readable) self.assertFalse('Linkblog updated' in doc.readable) self.assertFalse( '#anExampleGoogleDoesntIntendToShareBlogAndItWill' in doc.readable) @unittest.skip("Test fails because of some weird hash.") def test_candidates(self): """Verify we have candidates.""" doc = Article(self.article) # from lxml.etree import tounicode found = False wanted_hash = '04e46055' for node in doc.candidates.values(): if node.hash_id == wanted_hash: found = node self.assertTrue(found) # we have the right node, it must be deleted for some reason if it's # not still there when we need it to be. # Make sure it's not in our to drop list. for node in doc._should_drop: self.assertFalse(node == found.node) by_score = sorted( [c for c in doc.candidates.values()], key=attrgetter('content_score'), reverse=True) self.assertTrue(by_score[0].node == found.node) updated_winner = check_siblings(by_score[0], doc.candidates) updated_winner.node = prep_article(updated_winner.node) # This article hits up against the img > p conditional filtering # because of the many .gif images in the content. We've removed that # rule. breadability-0.1.20/tests/test_articles/test_businessinsider_com/0000775000175000017500000000000012322641424026762 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_businessinsider_com/__init__.py0000664000175000017500000000000012271036170031060 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_businessinsider_com/test_article.py0000664000175000017500000000255412320272411032016 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os.path import join, dirname from breadability.readable import Article from ...compat import unittest class TestArticle(unittest.TestCase): """ Test the scoring and parsing of the article from URL below: http://www.businessinsider.com/tech-ceos-favorite-productivity-hacks-2013-8 """ def setUp(self): """Load up the article for us""" article_path = join(dirname(__file__), "article.html") with open(article_path, "rb") as file: self.document = Article(file.read(), "http://www.businessinsider.com/tech-ceos-favorite-productivity-hacks-2013-8") def tearDown(self): """Drop the article""" self.document = None def test_parses(self): """Verify we can parse the document.""" self.assertIn('id="readabilityBody"', self.document.readable) def test_images_preserved(self): """The div with the comments should be removed.""" images = [ 'bharath-kumar-a-co-founder-at-pugmarksme-suggests-working-on-a-sunday-late-night.jpg', 'bryan-guido-hassin-a-university-professor-and-startup-junkie-uses-airplane-days.jpg', ] for image in images: self.assertIn(image, self.document.readable, image) breadability-0.1.20/tests/test_articles/test_cz_zdrojak_tests/0000775000175000017500000000000012322641424026275 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_cz_zdrojak_tests/__init__.py0000664000175000017500000000000012271036170030373 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_cz_zdrojak_tests/test_article.py0000664000175000017500000000324712320272411031331 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os.path import join, dirname from breadability.readable import Article from breadability._compat import unicode from ...compat import unittest class TestArticle(unittest.TestCase): """ Test the scoring and parsing of the article from URL below: http://www.zdrojak.cz/clanky/jeste-k-testovani/ """ def setUp(self): """Load up the article for us""" article_path = join(dirname(__file__), "article.html") with open(article_path, "rb") as file: self.document = Article(file.read(), "http://www.zdrojak.cz/clanky/jeste-k-testovani/") def tearDown(self): """Drop the article""" self.document = None def test_parses(self): """Verify we can parse the document.""" self.assertIn('id="readabilityBody"', self.document.readable) def test_content_exists(self): """Verify that some content exists.""" self.assertIsInstance(self.document.readable, unicode) text = "S automatizovaným testováním kódu (a ve zbytku článku budu mít na mysli právě to) jsem se setkal v několika firmách." self.assertIn(text, self.document.readable) text = "Ke čtení naleznete mnoho různých materiálů, od teoretických po praktické ukázky." self.assertIn(text, self.document.readable) def test_content_does_not_exist(self): """Verify we cleaned out some content that shouldn't exist.""" self.assertNotIn("Pokud vás problematika zajímá, využijte možnosti navštívit školení", self.document.readable) breadability-0.1.20/tests/test_articles/test_antipope_org/0000775000175000017500000000000012322641424025401 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_antipope_org/__init__.py0000664000175000017500000000000012271036170027477 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_antipope_org/test_article.py0000664000175000017500000000233712320272411030434 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals import os from breadability.readable import Article from ...compat import unittest class TestAntipopeBlog(unittest.TestCase): """Test the scoring and parsing of the Blog Post""" def setUp(self): """Load up the article for us""" article_path = os.path.join(os.path.dirname(__file__), 'article.html') self.article = open(article_path).read() def tearDown(self): """Drop the article""" self.article = None def test_parses(self): """Verify we can parse the document.""" doc = Article(self.article) self.assertTrue('id="readabilityBody"' in doc.readable) def test_comments_cleaned(self): """The div with the comments should be removed.""" doc = Article(self.article) self.assertTrue('class="comments"' not in doc.readable) def test_beta_removed(self): """The id=beta element should be removed It's link heavy and causing a lot of garbage content. This should be removed. """ doc = Article(self.article) self.assertTrue('id="beta"' not in doc.readable) breadability-0.1.20/tests/test_articles/test_sweetshark/0000775000175000017500000000000012322641424025073 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_sweetshark/__init__.py0000664000175000017500000000000012271036170027171 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_sweetshark/test_article.py0000664000175000017500000000210112320272411030113 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os.path import join, dirname from breadability.readable import Article from ...compat import unittest class TestSweetsharkBlog(unittest.TestCase): """ Test the scoring and parsing of the article from URL below: http://sweetshark.livejournal.com/11564.html """ def setUp(self): """Load up the article for us""" article_path = join(dirname(__file__), "article.html") with open(article_path, "rb") as file: self.document = Article(file.read(), "http://sweetshark.livejournal.com/11564.html") def tearDown(self): """Drop the article""" self.document = None def test_parses(self): """Verify we can parse the document.""" self.assertIn('id="readabilityBody"', self.document.readable) def test_content_after_video(self): """The div with the comments should be removed.""" self.assertIn('Stay hungry, Stay foolish', self.document.readable) breadability-0.1.20/tests/test_articles/test_businessinsider-com/0000775000175000017500000000000012322641424026700 5ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_businessinsider-com/__init__.py0000664000175000017500000000000012271036170030776 0ustar rhardingrharding00000000000000breadability-0.1.20/tests/test_articles/test_businessinsider-com/test_article.py0000664000175000017500000000211512320272411031725 0ustar rhardingrharding00000000000000import os try: # Python < 2.7 import unittest2 as unittest except ImportError: import unittest from breadability.readable import Article class TestBusinessInsiderArticle(unittest.TestCase): """Test the scoring and parsing of the Blog Post""" def setUp(self): """Load up the article for us""" article_path = os.path.join(os.path.dirname(__file__), 'article.html') self.article = open(article_path).read() def tearDown(self): """Drop the article""" self.article = None def test_parses(self): """Verify we can parse the document.""" doc = Article(self.article) self.assertTrue('id="readabilityBody"' in doc.readable) def test_images_preserved(self): """The div with the comments should be removed.""" doc = Article(self.article) self.assertTrue('bharath-kumar-a-co-founder-at-pugmarksme-suggests-working-on-a-sunday-late-night.jpg' in doc.readable) self.assertTrue('bryan-guido-hassin-a-university-professor-and-startup-junkie-uses-airplane-days.jpg' in doc.readable) breadability-0.1.20/tests/test_orig_document.py0000664000175000017500000000713312320273343023261 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from collections import defaultdict from breadability._compat import ( to_unicode, to_bytes, unicode, ) from breadability.document import ( convert_breaks_to_paragraphs, decode_html, OriginalDocument, ) from .compat import unittest from .utils import load_snippet class TestOriginalDocument(unittest.TestCase): """Verify we can process html into a document to work off of.""" def test_convert_br_tags_to_paragraphs(self): returned = convert_breaks_to_paragraphs( ("
    HI

    How are you?

    \t \n
    " "Fine\n I guess
    ")) self.assertEqual( returned, "
    HI

    How are you?

    Fine\n I guess

    ") def test_convert_hr_tags_to_paragraphs(self): returned = convert_breaks_to_paragraphs( "
    HI

    How are you?
    \t \n
    Fine\n I guess
    ") self.assertEqual( returned, "
    HI

    How are you?

    Fine\n I guess

    ") def test_readin_min_document(self): """Verify we can read in a min html document""" doc = OriginalDocument(load_snippet('document_min.html')) self.assertTrue(to_unicode(doc).startswith('')) self.assertEqual(doc.title, 'Min Document Title') def test_readin_with_base_url(self): """Passing a url should update links to be absolute links""" doc = OriginalDocument( load_snippet('document_absolute_url.html'), url="http://blog.mitechie.com/test.html") self.assertTrue(to_unicode(doc).startswith('')) # find the links on the page and make sure each one starts with out # base url we told it to use. links = doc.links self.assertEqual(len(links), 3) # we should have two links that start with our blog url # and one link that starts with amazon link_counts = defaultdict(int) for link in links: if link.get('href').startswith('http://blog.mitechie.com'): link_counts['blog'] += 1 else: link_counts['other'] += 1 self.assertEqual(link_counts['blog'], 2) self.assertEqual(link_counts['other'], 1) def test_no_br_allowed(self): """We convert all
    tags to

    tags""" doc = OriginalDocument(load_snippet('document_min.html')) self.assertIsNone(doc.dom.find('.//br')) def test_empty_title(self): """We convert all
    tags to

    tags""" document = OriginalDocument( "") self.assertEqual(document.title, "") def test_title_only_with_tags(self): """We convert all
    tags to

    tags""" document = OriginalDocument( "<em></em>") self.assertEqual(document.title, "") def test_no_title(self): """We convert all
    tags to

    tags""" document = OriginalDocument("") self.assertEqual(document.title, "") def test_encoding(self): text = "ľščťžýáíéäúňôůě".encode("iso-8859-2") html = decode_html(text) self.assertEqual(type(html), unicode) def test_encoding_short(self): text = to_bytes("ľščťžýáíé") html = decode_html(text) self.assertEqual(type(html), unicode) self.assertEqual(html, "ľščťžýáíé") breadability-0.1.20/tests/test_readable.py0000664000175000017500000003201512271036170022157 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from lxml.etree import tounicode from lxml.html import document_fromstring from lxml.html import fragment_fromstring from breadability._compat import to_unicode from breadability.readable import ( Article, get_class_weight, get_link_density, is_bad_link, leaf_div_elements_into_paragraphs, score_candidates, ) from breadability.scoring import ScoredNode from .compat import unittest from .utils import load_snippet, load_article class TestReadableDocument(unittest.TestCase): """Verify we can process html into a document to work off of.""" def test_load_doc(self): """We get back an element tree from our original doc""" doc = Article(load_snippet('document_min.html')) # We get back the document as a div tag currently by default. self.assertEqual(doc.readable_dom.tag, 'div') def test_title_loads(self): """Verify we can fetch the title of the parsed article""" doc = Article(load_snippet('document_min.html')) self.assertEqual( doc._original_document.title, 'Min Document Title' ) def test_doc_no_scripts_styles(self): """Step #1 remove all scripts from the document""" doc = Article(load_snippet('document_scripts.html')) readable = doc.readable_dom self.assertEqual(readable.findall(".//script"), []) self.assertEqual(readable.findall(".//style"), []) self.assertEqual(readable.findall(".//link"), []) def test_find_body_exists(self): """If the document has a body, we store that as the readable html No sense processing anything other than the body content. """ doc = Article(load_snippet('document_min.html')) self.assertEqual(doc.readable_dom.tag, 'div') self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody') def test_body_doesnt_exist(self): """If we can't find a body, then we create one. We build our doc around the rest of the html we parsed. """ doc = Article(load_snippet('document_no_body.html')) self.assertEqual(doc.readable_dom.tag, 'div') self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody') def test_bare_content(self): """If the document is just pure content, no html tags we should be ok We build our doc around the rest of the html we parsed. """ doc = Article(load_snippet('document_only_content.html')) self.assertEqual(doc.readable_dom.tag, 'div') self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody') def test_no_content(self): """Without content we supply an empty unparsed doc.""" doc = Article('') self.assertEqual(doc.readable_dom.tag, 'div') self.assertEqual(doc.readable_dom.get('id'), 'readabilityBody') self.assertEqual(doc.readable_dom.get('class'), 'parsing-error') class TestCleaning(unittest.TestCase): """Test out our cleaning processing we do.""" def test_unlikely_hits(self): """Verify we wipe out things from our unlikely list.""" doc = Article(load_snippet('test_readable_unlikely.html')) readable = doc.readable_dom must_not_appear = [ 'comment', 'community', 'disqus', 'extra', 'foot', 'header', 'menu', 'remark', 'rss', 'shoutbox', 'sidebar', 'sponsor', 'ad-break', 'agegate', 'pagination' '', 'pager', 'popup', 'tweet', 'twitter', 'imgBlogpostPermalink'] want_to_appear = ['and', 'article', 'body', 'column', 'main', 'shadow'] for i in must_not_appear: # we cannot find any class or id with this value by_class = readable.find_class(i) for test in by_class: # if it's here it cannot have the must not class without the # want to appear class found = False for cls in test.get('class').split(): if cls in want_to_appear: found = True self.assertTrue(found) by_ids = readable.get_element_by_id(i, False) if by_ids is not False: found = False for ids in test.get('id').split(): if ids in want_to_appear: found = True self.assertTrue(found) def test_misused_divs_transform(self): """Verify we replace leaf node divs with p's They should have the same content, just be a p vs a div """ test_html = "

    simple
    " test_doc = document_fromstring(test_html) self.assertEqual( tounicode( leaf_div_elements_into_paragraphs(test_doc)), to_unicode("

    simple

    ") ) test_html2 = ('
    simplelink' '
    ') test_doc2 = document_fromstring(test_html2) self.assertEqual( tounicode( leaf_div_elements_into_paragraphs(test_doc2)), to_unicode( '

    simplelink

    ') ) def test_dont_transform_div_with_div(self): """Verify that only child
    element is replaced by

    .""" dom = document_fromstring( "

    text
    child
    " "aftertext
    " ) self.assertEqual( tounicode( leaf_div_elements_into_paragraphs(dom)), to_unicode( "
    text

    child

    " "aftertext
    " ) ) def test_bad_links(self): """Some links should just not belong.""" bad_links = [ ' ', 'permalink', 'permalink' ] for l in bad_links: link = fragment_fromstring(l) self.assertTrue(is_bad_link(link)) class TestCandidateNodes(unittest.TestCase): """Candidate nodes are scoring containers we use.""" def test_candidate_scores(self): """We should be getting back objects with some scores.""" fives = ['
    '] threes = ['
    ', '', '
    '] neg_threes = ['
    ', '
      '] neg_fives = ['

      ', '

      ', '

      ', '

      '] for n in fives: doc = fragment_fromstring(n) self.assertEqual(ScoredNode(doc).content_score, 5) for n in threes: doc = fragment_fromstring(n) self.assertEqual(ScoredNode(doc).content_score, 3) for n in neg_threes: doc = fragment_fromstring(n) self.assertEqual(ScoredNode(doc).content_score, -3) for n in neg_fives: doc = fragment_fromstring(n) self.assertEqual(ScoredNode(doc).content_score, -5) def test_article_enables_candidate_access(self): """Candidates are accessible after document processing.""" doc = Article(load_article('ars.001.html')) self.assertTrue(hasattr(doc, 'candidates')) class TestClassWeights(unittest.TestCase): """Certain ids and classes get us bonus points.""" def test_positive_class(self): """Some classes get us bonus points.""" node = fragment_fromstring('

      ') self.assertEqual(get_class_weight(node), 25) def test_positive_ids(self): """Some ids get us bonus points.""" node = fragment_fromstring('

      ') self.assertEqual(get_class_weight(node), 25) def test_negative_class(self): """Some classes get us negative points.""" node = fragment_fromstring('

      ') self.assertEqual(get_class_weight(node), -25) def test_negative_ids(self): """Some ids get us negative points.""" node = fragment_fromstring('

      ') self.assertEqual(get_class_weight(node), -25) class TestScoringNodes(unittest.TestCase): """We take out list of potential nodes and score them up.""" def test_we_get_candidates(self): """Processing candidates should get us a list of nodes to try out.""" doc = document_fromstring(load_article("ars.001.html")) test_nodes = tuple(doc.iter("p", "td", "pre")) candidates = score_candidates(test_nodes) # this might change as we tweak our algorithm, but if it does, # it signifies we need to look at what we changed. self.assertEqual(len(candidates.keys()), 37) # one of these should have a decent score scores = sorted(c.content_score for c in candidates.values()) self.assertTrue(scores[-1] > 100) def test_bonus_score_per_100_chars_in_p(self): """Nodes get 1 point per 100 characters up to max. 3 points.""" def build_candidates(length): html = "

      %s

      " % ("c" * length) node = fragment_fromstring(html) return [node] test_nodes = build_candidates(50) candidates = score_candidates(test_nodes) pscore_50 = max(c.content_score for c in candidates.values()) test_nodes = build_candidates(100) candidates = score_candidates(test_nodes) pscore_100 = max(c.content_score for c in candidates.values()) test_nodes = build_candidates(300) candidates = score_candidates(test_nodes) pscore_300 = max(c.content_score for c in candidates.values()) test_nodes = build_candidates(400) candidates = score_candidates(test_nodes) pscore_400 = max(c.content_score for c in candidates.values()) self.assertAlmostEqual(pscore_50 + 0.5, pscore_100) self.assertAlmostEqual(pscore_100 + 2.0, pscore_300) self.assertAlmostEqual(pscore_300, pscore_400) class TestLinkDensityScoring(unittest.TestCase): """Link density will adjust out candidate scoresself.""" def test_link_density(self): """Test that we get a link density""" doc = document_fromstring(load_article('ars.001.html')) for node in doc.iter('p', 'td', 'pre'): density = get_link_density(node) # the density must be between 0, 1 self.assertTrue(density >= 0.0 and density <= 1.0) class TestSiblings(unittest.TestCase): """Siblings will be included if their content is related.""" @unittest.skip("Not implemented yet.") def test_bad_siblings_not_counted(self): raise NotImplementedError() @unittest.skip("Not implemented yet.") def test_good_siblings_counted(self): raise NotImplementedError() class TestMainText(unittest.TestCase): def test_empty(self): article = Article("") annotated_text = article.main_text self.assertEqual(annotated_text, []) def test_no_annotations(self): article = Article("

      This is text with no annotations

      ") annotated_text = article.main_text self.assertEqual(annotated_text, [(("This is text with no annotations", None),)]) def test_one_annotation(self): article = Article("

      This is text\r\twith no annotations

      ") annotated_text = article.main_text expected = [( ("This is text\nwith", None), ("no", ("del",)), ("annotations", None), )] self.assertEqual(annotated_text, expected) def test_simple_snippet(self): snippet = Article(load_snippet("annotated_1.html")) annotated_text = snippet.main_text expected = [ ( ("Paragraph is more", None), ("better", ("em",)), (".\nThis text is very", None), ("pretty", ("strong",)), ("'cause she's girl.", None), ), ( ("This is not", None), ("crap", ("big",)), ("so", None), ("readability", ("dfn",)), ("me :)", None), ) ] self.assertEqual(annotated_text, expected) breadability-0.1.20/tests/utils.py0000664000175000017500000000120712271036170020520 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import absolute_import from __future__ import division, print_function, unicode_literals from os.path import abspath, dirname, join TEST_DIR = abspath(dirname(__file__)) def load_snippet(file_name): """Helper to fetch in the content of a test snippet.""" file_path = join(TEST_DIR, "data/snippets", file_name) with open(file_path, "rb") as file: return file.read() def load_article(file_name): """Helper to fetch in the content of a test article.""" file_path = join(TEST_DIR, "data/articles", file_name) with open(file_path, "rb") as file: return file.read() breadability-0.1.20/tests/test_annotated_text.py0000664000175000017500000001405512271036170023445 0ustar rhardingrharding00000000000000# -*- coding: utf8 -*- from __future__ import ( absolute_import, division, print_function, unicode_literals ) from lxml.html import fragment_fromstring, document_fromstring from breadability.readable import Article from breadability.annotated_text import AnnotatedTextHandler from .compat import unittest from .utils import load_snippet, load_article class TestAnnotatedText(unittest.TestCase): def test_simple_document(self): dom = fragment_fromstring("

      This is\n\tsimple\ttext.

      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("This is\nsimple text.", None), ), ] self.assertEqual(annotated_text, expected) def test_empty_paragraph(self): dom = fragment_fromstring("

      Paragraph

      \t \n

      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("Paragraph", None), ), ] self.assertEqual(annotated_text, expected) def test_multiple_paragraphs(self): dom = fragment_fromstring("

      1 first

      2\tsecond

      3\rthird

      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("1 first", None), ), ( ("2 second", None), ), ( ("3\nthird", None), ), ] self.assertEqual(annotated_text, expected) def test_single_annotation(self): dom = fragment_fromstring("

      text emphasis

      last

      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("text", None), ("emphasis", ("em",)), ), ( ("last", None), ), ] self.assertEqual(annotated_text, expected) def test_recursive_annotation(self): dom = fragment_fromstring("

      text emphasis

      last

      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("text", None), ("emphasis", ("em", "i")), ), ( ("last", None), ), ] self.assertEqual(annotated_text, expected) def test_annotations_without_explicit_paragraph(self): dom = fragment_fromstring("
      text emphasis\thmm
      ") annotated_text = AnnotatedTextHandler.parse(dom) expected = [ ( ("text", None), ("emphasis", ("strong",)), ("hmm", ("b",)), ), ] self.assertEqual(annotated_text, expected) def test_process_paragraph_with_chunked_text(self): handler = AnnotatedTextHandler() paragraph = handler._process_paragraph([ (" 1", ("b", "del")), (" 2", ("b", "del")), (" 3", None), (" 4", None), (" 5", None), (" 6", ("em",)), ]) expected = ( ("1 2", ("b", "del")), ("3 4 5", None), ("6", ("em",)), ) self.assertEqual(paragraph, expected) def test_include_heading(self): dom = document_fromstring(load_snippet("h1_and_2_paragraphs.html")) annotated_text = AnnotatedTextHandler.parse(dom.find("body")) expected = [ ( ('Nadpis H1, ktorý chce byť prvý s textom ale predbehol ho "title"', ("h1",)), ("Toto je prvý odstavec a to je fajn.", None), ), ( ("Tento text je tu aby vyplnil prázdne miesto v srdci súboru.\nAj súbory majú predsa city.", None), ), ] self.assertSequenceEqual(annotated_text, expected) def test_real_article(self): article = Article(load_article("zdrojak_automaticke_zabezpeceni.html")) annotated_text = article.main_text expected = [ ( ("Automatické zabezpečení", ("h1",)), ("Úroveň zabezpečení aplikace bych rozdělil do tří úrovní:", None), ), ( ("Aplikace zabezpečená není, neošetřuje uživatelské vstupy ani své výstupy.", ("li", "ol")), ("Aplikace se o zabezpečení snaží, ale takovým způsobem, že na ně lze zapomenout.", ("li", "ol")), ("Aplikace se o zabezpečení stará sama, prakticky se nedá udělat chyba.", ("li", "ol")), ), ( ("Jak se tyto úrovně projevují v jednotlivých oblastech?", None), ), ( ("XSS", ("a", "h2")), ("Druhou úroveň představuje ruční ošetřování pomocí", None), ("htmlspecialchars", ("a", "kbd")), (". Třetí úroveň zdánlivě reprezentuje automatické ošetřování v šablonách, např. v", None), ("Nette Latte", ("a", "strong")), (". Proč píšu zdánlivě? Problém je v tom, že ošetření se dá obvykle snadno zakázat, např. v Latte pomocí", None), ("{!$var}", ("code",)), (". Viděl jsem šablony plné vykřičníků i na místech, kde být neměly. Autor to vysvětlil tak, že psaní", None), ("{$var}", ("code",)), ("někde způsobovalo problémy, které po přidání vykřičníku zmizely, tak je začal psát všude.", None), ), ( ("process($content_texy);\n$content = Html::el()->setHtml($safeHtml);\n// v šabloně pak můžeme použít {$content}\n?>", ("pre", )), ), ( ("Ideální by bylo, když by už samotná metoda", None), ("process()", ("code",)), ("vracela instanci", None), ("Html", ("code",)), (".", None), ), ] self.assertSequenceEqual(annotated_text, expected)