This is a great amount of info
And more content Home
breadability-0.1.20/ 0000775 0001750 0001750 00000000000 12322641424 015645 5 ustar rharding rharding 0000000 0000000 breadability-0.1.20/PKG-INFO 0000664 0001750 0001750 00000025670 12322641424 016754 0 ustar rharding rharding 0000000 0000000 Metadata-Version: 1.1 Name: breadability Version: 0.1.20 Summary: Port of Readability HTML parser in Python Home-page: https://github.com/bookieio/breadability Author: Rick Harding Author-email: rharding@mitechie.com License: BSD Description: breadability - another readability Python (v2.6-v3.3) port =========================================================== .. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master :target: https://travis-ci.org/bookieio/breadability.py I've tried to work with the various forks of some ancient codebase that ported `readability`_ to Python. The lack of tests, unused regex's, and commented out sections of code in other Python ports just drove me nuts. I put forth an effort to bring in several of the better forks into one code base, but they've diverged so much that I just can't work with it. So what's any sane person to do? Re-port it with my own repo, add some tests, infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML, but oh well I did try) This is a pretty straight port of the JS here: - http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82 - http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/ Alternatives ------------ - https://github.com/codelucas/newspaper - https://github.com/grangier/python-goose - https://github.com/aidanf/BTE - http://www.unixuser.org/~euske/python/webstemmer/#extract - https://github.com/al3xandru/readability.py - https://github.com/rcarmo/soup-strainer - https://github.com/bcampbell/decruft - https://github.com/gfxmonk/python-readability - https://github.com/srid/readability - https://github.com/dcramer/decruft - https://github.com/reorx/readability - https://github.com/mote/python-readability - https://github.com/predatell/python-readability-lxml - https://github.com/Harshavardhana/boilerpipy - https://github.com/raptium/hitomi - https://github.com/kingwkb/readability Installation ------------ This does depend on lxml so you'll need some C headers in order to install things from pip so that it can compile. .. code-block:: bash $ [sudo] apt-get install libxml2-dev libxslt-dev $ [sudo] pip install git+git://github.com/bookieio/breadability.git Tests ----- .. code-block:: bash $ nosetests-2.6 tests && nosetests-3.2 tests && nosetests-2.7 tests && nosetests-3.3 tests Usage ----- Command line ~~~~~~~~~~~~ .. code-block:: bash $ breadability http://wiki.python.org/moin/BeginnersGuide Options ``````` - **b** will write out the parsed content to a temp file and open it in a browser for viewing. - **d** will write out debug scoring statements to help track why a node was chosen as the document and why some nodes were removed from the final product. - **f** will override the default behaviour of getting an html fragment (
if it contains
elements.
- Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Fork: Renamed package to readability. (Renamed back)
- Fork: Added support for Python >= 3.2.
- Fork: Py3k compatible package 'charade' is used instead of 'chardet'.
0.1.14 (Nov 7th 2013)
---------------------
- Update sibling append to only happen when sibling doesn't already exist.
0.1.13 (Aug 31st 2013)
----------------------
- Give images in content boy a better chance of survival
- Add tests
0.1.12 (July 28th 2013)
-----------------------
- Add a user agent to requests.
0.1.11 (Dec 12th 2012)
----------------------
- Add argparse to the install requires for python < 2.7
0.1.10 (Sept 13th 2012)
-----------------------
- Updated scoring bonus and penalty with , and " characters.
0.1.9 (Aug 27nd 2012)
---------------------
- In case of an issue dealing with candidates we need to act like we didn't
find any candidates for the article content. #10
0.1.8 (Aug 27nd 2012)
---------------------
- Add code/tests for an empty document.
- Fixes #9 to handle xml parsing issues.
0.1.7 (July 21nd 2012)
----------------------
- Change the encode 'replace' kwarg into a normal arg for older python
version.
0.1.6 (June 17th 2012)
----------------------
- Fix the link removal, add tests and a place to process other bad links.
0.1.5 (June 16th 2012)
----------------------
- Start to look at removing bad links from content in the conditional cleaning
state. This was really used for the scripting.com site's garbage.
0.1.4 (June 16th 2012)
----------------------
- Add a test generation helper readability_newtest script.
- Add tests and fixes for the scripting news parse failure.
0.1.3 (June 15th 2012)
----------------------
- Add actual testing of full articles for regression tests.
- Update parser to properly clean after winner doc node is chosen.
0.1.2 (May 28th 2012)
---------------------
- Bugfix: #4 issue with logic of the 100char bonus points in scoring
- Garden with PyLint/PEP8
- Add a bunch of tests to readable/scoring code.
0.1.1 (May 11th 2012)
---------------------
- Fix bugs in scoring to help in getting right content
- Add concept of -d which shows scoring/decisions on nodes
- Update command line client to be able to pipe output to other tools
0.1.0 (May 6th 2012)
--------------------
- Initial release and upload to PyPi
Keywords: bookie,breadability,content,HTML,parsing,readability,readable
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Pre-processors
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: Markup :: HTML
breadability-0.1.20/setup.cfg 0000664 0001750 0001750 00000000236 12322641424 017467 0 ustar rharding rharding 0000000 0000000 [nosetests]
with-coverage = 1
cover-package = breadability
cover-erase = 1
[wheel]
universal = 1
[egg_info]
tag_build =
tag_date = 0
tag_svn_revision = 0
breadability-0.1.20/setup.py 0000664 0001750 0001750 00000005047 12322641326 017366 0 ustar rharding rharding 0000000 0000000 import sys
from os.path import (
abspath,
dirname,
join,
)
from setuptools import setup
VERSION = "0.1.20"
VERSION_SUFFIX = "%d.%d" % sys.version_info[:2]
CURRENT_DIRECTORY = abspath(dirname(__file__))
with open(join(CURRENT_DIRECTORY, "README.rst")) as readme:
with open(join(CURRENT_DIRECTORY, "CHANGELOG.rst")) as changelog:
long_description = "%s\n\n%s" % (readme.read(), changelog.read())
install_requires = [
"docopt>=0.6.1,<0.7",
"chardet",
"lxml>=2.0",
]
tests_require = [
"nose-selecttests",
"coverage",
"pylint",
"nose",
"pep8",
]
if sys.version_info < (2, 7):
install_requires.append("unittest2")
console_script_targets = [
"breadability = breadability.scripts.client:main",
"breadability-{0} = breadability.scripts.client:main",
"breadability_test = breadability.scripts.test_helper:main",
"breadability_test-{0} = breadability.scripts.test_helper:main",
]
console_script_targets = [
target.format(VERSION_SUFFIX) for target in console_script_targets
]
setup(
name="breadability",
version=VERSION,
description="Port of Readability HTML parser in Python",
long_description=long_description,
keywords=[
"bookie",
"breadability",
"content",
"HTML",
"parsing",
"readability",
"readable",
],
author="Rick Harding",
author_email="rharding@mitechie.com",
url="https://github.com/bookieio/breadability",
license="BSD",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: BSD License",
"Operating System :: OS Independent",
"Programming Language :: Python",
"Programming Language :: Python :: 2",
"Programming Language :: Python :: 2.6",
"Programming Language :: Python :: 2.7",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.2",
"Programming Language :: Python :: 3.3",
"Programming Language :: Python :: Implementation :: CPython",
"Topic :: Internet :: WWW/HTTP",
"Topic :: Software Development :: Pre-processors",
"Topic :: Text Processing :: Filters",
"Topic :: Text Processing :: Markup :: HTML",
],
packages=['breadability'],
include_package_data=True,
zip_safe=False,
install_requires=install_requires,
tests_require=tests_require,
test_suite="nose.collector",
entry_points={
"console_scripts": console_script_targets,
}
)
breadability-0.1.20/CHANGELOG.rst 0000664 0001750 0001750 00000006537 12322641374 017705 0 ustar rharding rharding 0000000 0000000 .. :changelog:
Changelog for breadability
==========================
0.1.20 (April 13th 2014)
-------------------------
- Don't include tests in sdist builds.
0.1.19 (April 13th 2014)
--------------------------
- Replace charade with chardet for easier packaging.
0.1.18 (April 6th 2014)
------------------------
- Improved decoding of the page into Unicode.
0.1.17 (Jan 22nd 2014)
----------------------
- More log quieting down to INFO vs WARN
0.1.16 (Jan 22nd 2014)
----------------------
- Clean up logging output at warning when it's not a true warning
0.1.15 (Nov 29th 2013)
----------------------
- Merge changes from 0.1.14 of breadability with the fork https://github.com/miso-belica/readability.py and tweaking to return to the name breadability.
- Fork: Added property ``Article.main_text`` for getting text annotated with
semantic HTML tags (, , ...).
- Fork: Join node with 1 child of the same type. From
`` if it contains elements.
- Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Fork: Renamed package to readability. (Renamed back)
- Fork: Added support for Python >= 3.2.
- Fork: Py3k compatible package 'charade' is used instead of 'chardet'.
0.1.14 (Nov 7th 2013)
---------------------
- Update sibling append to only happen when sibling doesn't already exist.
0.1.13 (Aug 31st 2013)
----------------------
- Give images in content boy a better chance of survival
- Add tests
0.1.12 (July 28th 2013)
-----------------------
- Add a user agent to requests.
0.1.11 (Dec 12th 2012)
----------------------
- Add argparse to the install requires for python < 2.7
0.1.10 (Sept 13th 2012)
-----------------------
- Updated scoring bonus and penalty with , and " characters.
0.1.9 (Aug 27nd 2012)
---------------------
- In case of an issue dealing with candidates we need to act like we didn't
find any candidates for the article content. #10
0.1.8 (Aug 27nd 2012)
---------------------
- Add code/tests for an empty document.
- Fixes #9 to handle xml parsing issues.
0.1.7 (July 21nd 2012)
----------------------
- Change the encode 'replace' kwarg into a normal arg for older python
version.
0.1.6 (June 17th 2012)
----------------------
- Fix the link removal, add tests and a place to process other bad links.
0.1.5 (June 16th 2012)
----------------------
- Start to look at removing bad links from content in the conditional cleaning
state. This was really used for the scripting.com site's garbage.
0.1.4 (June 16th 2012)
----------------------
- Add a test generation helper readability_newtest script.
- Add tests and fixes for the scripting news parse failure.
0.1.3 (June 15th 2012)
----------------------
- Add actual testing of full articles for regression tests.
- Update parser to properly clean after winner doc node is chosen.
0.1.2 (May 28th 2012)
---------------------
- Bugfix: #4 issue with logic of the 100char bonus points in scoring
- Garden with PyLint/PEP8
- Add a bunch of tests to readable/scoring code.
0.1.1 (May 11th 2012)
---------------------
- Fix bugs in scoring to help in getting right content
- Add concept of -d which shows scoring/decisions on nodes
- Update command line client to be able to pipe output to other tools
0.1.0 (May 6th 2012)
--------------------
- Initial release and upload to PyPi
breadability-0.1.20/breadability/ 0000775 0001750 0001750 00000000000 12322641424 020300 5 ustar rharding rharding 0000000 0000000 breadability-0.1.20/breadability/scoring.py 0000664 0001750 0001750 00000020554 12271036170 022323 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
"""Handle dealing with scoring nodes and content for our parsing."""
from __future__ import absolute_import
from __future__ import division, print_function
import re
import logging
from hashlib import md5
from lxml.etree import tostring
from ._compat import to_bytes
from .utils import normalize_whitespace
# A series of sets of attributes we check to help in determining if a node is
# a potential candidate or not.
CLS_UNLIKELY = re.compile(
"combx|comment|community|disqus|extra|foot|header|menu|remark|rss|"
"shoutbox|sidebar|sponsor|ad-break|agegate|pagination|pager|perma|popup|"
"tweet|twitter|social|breadcrumb",
re.IGNORECASE
)
CLS_MAYBE = re.compile(
"and|article|body|column|main|shadow|entry",
re.IGNORECASE
)
CLS_WEIGHT_POSITIVE = re.compile(
"article|body|content|entry|main|page|pagination|post|text|blog|story",
re.IGNORECASE
)
CLS_WEIGHT_NEGATIVE = re.compile(
"combx|comment|com-|contact|foot|footer|footnote|head|masthead|media|meta|"
"outbrain|promo|related|scroll|shoutbox|sidebar|sponsor|shopping|tags|"
"tool|widget",
re.IGNORECASE
)
logger = logging.getLogger("breadability")
def check_node_attributes(pattern, node, *attributes):
"""
Searches match in attributes against given pattern and if
finds the match against any of them returns True.
"""
for attribute_name in attributes:
attribute = node.get(attribute_name)
if attribute is not None and pattern.search(attribute):
return True
return False
def generate_hash_id(node):
"""
Generates a hash_id for the node in question.
:param node: lxml etree node
"""
try:
content = tostring(node)
except Exception:
logger.exception("Generating of hash failed")
content = to_bytes(repr(node))
hash_id = md5(content).hexdigest()
return hash_id[:8]
def get_link_density(node, node_text=None):
"""
Computes the ratio for text in given node and text in links
contained in the node. It is computed from number of
characters in the texts.
:parameter Element node:
HTML element in which links density is computed.
:parameter string node_text:
Text content of given node if it was obtained before.
:returns float:
Returns value of computed 0 <= density <= 1, where 0 means
no links and 1 means that node contains only links.
"""
if node_text is None:
node_text = node.text_content()
node_text = normalize_whitespace(node_text.strip())
text_length = len(node_text)
if text_length == 0:
return 0.0
links_length = sum(map(_get_normalized_text_length, node.findall(".//a")))
# Give 50 bonus chars worth of length for each img.
# Tweaking this 50 down a notch should help if we hit false positives.
img_bonuses = 50 * len(node.findall(".//img"))
links_length = max(0, links_length - img_bonuses)
return links_length / text_length
def _get_normalized_text_length(node):
return len(normalize_whitespace(node.text_content().strip()))
def get_class_weight(node):
"""
Computes weight of element according to its class/id.
We're using sets to help efficiently check for existence of matches.
"""
weight = 0
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "class"):
weight -= 25
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "class"):
weight += 25
if check_node_attributes(CLS_WEIGHT_NEGATIVE, node, "id"):
weight -= 25
if check_node_attributes(CLS_WEIGHT_POSITIVE, node, "id"):
weight += 25
return weight
def is_unlikely_node(node):
"""
Short helper for checking unlikely status.
If the class or id are in the unlikely list, and there's not also a
class/id in the likely list then it might need to be removed.
"""
unlikely = check_node_attributes(CLS_UNLIKELY, node, "class", "id")
maybe = check_node_attributes(CLS_MAYBE, node, "class", "id")
return bool(unlikely and not maybe and node.tag != "body")
def score_candidates(nodes):
"""Given a list of potential nodes, find some initial scores to start"""
MIN_HIT_LENTH = 25
candidates = {}
for node in nodes:
logger.debug("* Scoring candidate %s %r", node.tag, node.attrib)
# if the node has no parent it knows of then it ends up creating a
# body & html tag to parent the html fragment
parent = node.getparent()
if parent is None:
logger.debug("Skipping candidate - parent node is 'None'.")
continue
grand = parent.getparent()
if grand is None:
logger.debug("Skipping candidate - grand parent node is 'None'.")
continue
# if paragraph is < `MIN_HIT_LENTH` characters don't even count it
inner_text = node.text_content().strip()
if len(inner_text) < MIN_HIT_LENTH:
logger.debug(
"Skipping candidate - inner text < %d characters.",
MIN_HIT_LENTH)
continue
# initialize readability data for the parent
# add parent node if it isn't in the candidate list
if parent not in candidates:
candidates[parent] = ScoredNode(parent)
if grand not in candidates:
candidates[grand] = ScoredNode(grand)
# add a point for the paragraph itself as a base
content_score = 1
if inner_text:
# add 0.25 points for any commas within this paragraph
commas_count = inner_text.count(",")
content_score += commas_count * 0.25
logger.debug("Bonus points for %d commas.", commas_count)
# subtract 0.5 points for each double quote within this paragraph
double_quotes_count = inner_text.count('"')
content_score += double_quotes_count * -0.5
logger.debug(
"Penalty points for %d double-quotes.", double_quotes_count)
# for every 100 characters in this paragraph, add another point
# up to 3 points
length_points = len(inner_text) / 100
content_score += min(length_points, 3.0)
logger.debug("Bonus points for length of text: %f", length_points)
# add the score to the parent
logger.debug(
"Bonus points for parent %s %r with score %f: %f",
parent.tag, parent.attrib, candidates[parent].content_score,
content_score)
candidates[parent].content_score += content_score
# the grand node gets half
logger.debug(
"Bonus points for grand %s %r with score %f: %f",
grand.tag, grand.attrib, candidates[grand].content_score,
content_score / 2.0)
candidates[grand].content_score += content_score / 2.0
if node not in candidates:
candidates[node] = ScoredNode(node)
candidates[node].content_score += content_score
for candidate in candidates.values():
adjustment = 1.0 - get_link_density(candidate.node)
candidate.content_score *= adjustment
logger.debug(
"Link density adjustment for %s %r: %f",
candidate.node.tag, candidate.node.attrib, adjustment)
return candidates
class ScoredNode(object):
"""
We need Scored nodes we use to track possible article matches
We might have a bunch of these so we use __slots__ to keep memory usage
down.
"""
__slots__ = ('node', 'content_score')
def __init__(self, node):
"""Given node, set an initial score and weigh based on css and id"""
self.node = node
self.content_score = 0
if node.tag in ('div', 'article'):
self.content_score = 5
if node.tag in ('pre', 'td', 'blockquote'):
self.content_score = 3
if node.tag in ('address', 'ol', 'ul', 'dl', 'dd', 'dt', 'li', 'form'):
self.content_score = -3
if node.tag in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'th'):
self.content_score = -5
self.content_score += get_class_weight(node)
@property
def hash_id(self):
return generate_hash_id(self.node)
def __repr__(self):
if self.node is None:
return " .")
return BREAK_TAGS_PATTERN.sub(_replace_break_tags, html)
def _replace_break_tags(match):
tags = match.group()
if to_unicode(" ")
elif tags.count(to_unicode(" ")
else:
return tags
def build_document(html_content, base_href=None):
"""Requires that the `html_content` not be None"""
assert html_content is not None
if isinstance(html_content, unicode):
html_content = html_content.encode("utf8", "xmlcharrefreplace")
try:
document = document_fromstring(html_content, parser=UTF8_PARSER)
except XMLSyntaxError:
raise ValueError("Failed to parse document contents.")
if base_href:
document.make_links_absolute(base_href, resolve_base_href=True)
else:
document.resolve_base_href()
return document
@unicode_compatible
class OriginalDocument(object):
"""The original document to process."""
def __init__(self, html, url=None):
self._html = html
self._url = url
@property
def url(self):
"""Source URL of HTML document."""
return self._url
def __unicode__(self):
"""Renders the document as a string."""
return tounicode(self.dom)
@cached_property
def dom(self):
"""Parsed HTML document from the input."""
html = self._html
if not isinstance(html, unicode):
html = decode_html(html)
html = convert_breaks_to_paragraphs(html)
document = build_document(html, self._url)
return document
@cached_property
def links(self):
"""Links within the document."""
return self.dom.findall(".//a")
@cached_property
def title(self):
"""Title attribute of the parsed document."""
title_element = self.dom.find(".//title")
if title_element is None or title_element.text is None:
return ""
else:
return title_element.text.strip()
breadability-0.1.20/breadability/annotated_text.py 0000664 0001750 0001750 00000005272 12271036170 023700 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from itertools import groupby
from lxml.sax import saxify, ContentHandler
from .utils import is_blank, shrink_text
from ._compat import to_unicode
_SEMANTIC_TAGS = frozenset((
"a", "abbr", "acronym", "b", "big", "blink", "blockquote", "cite", "code",
"dd", "del", "dfn", "dir", "dl", "dt", "em", "h", "h1", "h2", "h3", "h4",
"h5", "h6", "i", "ins", "kbd", "li", "marquee", "menu", "ol", "pre", "q",
"s", "samp", "strike", "strong", "sub", "sup", "tt", "u", "ul", "var",
))
class AnnotatedTextHandler(ContentHandler):
"""A class for converting a HTML DOM into annotated text."""
@classmethod
def parse(cls, dom):
"""Converts DOM into paragraphs."""
handler = cls()
saxify(dom, handler)
return handler.content
def __init__(self):
self._content = []
self._paragraph = []
self._dom_path = []
@property
def content(self):
return self._content
def startElementNS(self, name, qname, attrs):
namespace, name = name
if name in _SEMANTIC_TAGS:
self._dom_path.append(to_unicode(name))
def endElementNS(self, name, qname):
namespace, name = name
if name == "p" and self._paragraph:
self._append_paragraph(self._paragraph)
elif name in ("ol", "ul", "pre") and self._paragraph:
self._append_paragraph(self._paragraph)
self._dom_path.pop()
elif name in _SEMANTIC_TAGS:
self._dom_path.pop()
def endDocument(self):
if self._paragraph:
self._append_paragraph(self._paragraph)
def _append_paragraph(self, paragraph):
paragraph = self._process_paragraph(paragraph)
self._content.append(paragraph)
self._paragraph = []
def _process_paragraph(self, paragraph):
current_paragraph = []
for annotation, items in groupby(paragraph, key=lambda i: i[1]):
if annotation and "li" in annotation:
for text, _ in items:
text = shrink_text(text)
current_paragraph.append((text, annotation))
else:
text = "".join(i[0] for i in items)
text = shrink_text(text)
current_paragraph.append((text, annotation))
return tuple(current_paragraph)
def characters(self, content):
if is_blank(content):
return
if self._dom_path:
pair = (content, tuple(sorted(frozenset(self._dom_path))))
else:
pair = (content, None)
self._paragraph.append(pair)
breadability-0.1.20/breadability/_compat.py 0000664 0001750 0001750 00000005304 12271036170 022275 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
from sys import version_info
PY3 = version_info[0] == 3
if PY3:
bytes = bytes
unicode = str
else:
bytes = str
unicode = unicode
string_types = (bytes, unicode,)
try:
# Assert to hush pyflakes about the unused import. This is a _compat
# module and we expect this to aid in other code importing urllib.
import urllib2 as urllib
assert urllib
except ImportError:
import urllib.request as urllib
assert urllib
def unicode_compatible(cls):
"""
Decorator for unicode compatible classes. Method ``__unicode__``
has to be implemented to work decorator as expected.
"""
if PY3:
cls.__str__ = cls.__unicode__
cls.__bytes__ = lambda self: self.__str__().encode("utf8")
else:
cls.__str__ = lambda self: self.__unicode__().encode("utf8")
return cls
def to_string(object):
return to_unicode(object) if PY3 else to_bytes(object)
def to_bytes(object):
try:
if isinstance(object, bytes):
return object
elif isinstance(object, unicode):
return object.encode("utf8")
else:
# try encode instance to bytes
return instance_to_bytes(object)
except UnicodeError:
# recover from codec error and use 'repr' function
return to_bytes(repr(object))
def to_unicode(object):
try:
if isinstance(object, unicode):
return object
elif isinstance(object, bytes):
return object.decode("utf8")
else:
# try decode instance to unicode
return instance_to_unicode(object)
except UnicodeError:
# recover from codec error and use 'repr' function
return to_unicode(repr(object))
def instance_to_bytes(instance):
if PY3:
if hasattr(instance, "__bytes__"):
return bytes(instance)
elif hasattr(instance, "__str__"):
return unicode(instance).encode("utf8")
else:
if hasattr(instance, "__str__"):
return bytes(instance)
elif hasattr(instance, "__unicode__"):
return unicode(instance).encode("utf8")
return to_bytes(repr(instance))
def instance_to_unicode(instance):
if PY3:
if hasattr(instance, "__str__"):
return unicode(instance)
elif hasattr(instance, "__bytes__"):
return bytes(instance).decode("utf8")
else:
if hasattr(instance, "__unicode__"):
return unicode(instance)
elif hasattr(instance, "__str__"):
return bytes(instance).decode("utf8")
return to_unicode(repr(instance))
breadability-0.1.20/breadability/scripts/ 0000775 0001750 0001750 00000000000 12322641424 021767 5 ustar rharding rharding 0000000 0000000 breadability-0.1.20/breadability/scripts/__init__.py 0000664 0001750 0001750 00000000000 12271036170 024065 0 ustar rharding rharding 0000000 0000000 breadability-0.1.20/breadability/scripts/client.py 0000664 0001750 0001750 00000004414 12271036170 023621 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
"""
A fast python port of arc90's readability tool
Usage:
breadability [options]
- extra tags
"""
return clean_document(doc)
def find_candidates(document):
"""
Finds cadidate nodes for the readable version of the article.
Here's we're going to remove unlikely nodes, find scores on the rest,
clean up and return the final best match.
"""
nodes_to_score = set()
should_remove = set()
for node in document.iter():
if is_unlikely_node(node):
logger.debug(
"We should drop unlikely: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif is_bad_link(node):
logger.debug(
"We should drop bad link: %s %r", node.tag, node.attrib)
should_remove.add(node)
elif node.tag in SCORABLE_TAGS:
nodes_to_score.add(node)
return score_candidates(nodes_to_score), should_remove
def is_bad_link(node):
"""
Helper to determine if the node is link that is useless.
We've hit articles with many multiple links that should be cleaned out
because they're just there to pollute the space. See tests for examples.
"""
if node.tag != "a":
return False
name = node.get("name")
href = node.get("href")
if name and not href:
return True
if href:
href_parts = href.split("#")
if len(href_parts) == 2 and len(href_parts[1]) > 25:
return True
return False
class Article(object):
"""Parsed readable object"""
def __init__(self, html, url=None, return_fragment=True):
"""
Create the Article we're going to use.
:param html: The string of HTML we're going to parse.
:param url: The url so we can adjust the links to still work.
:param return_fragment: Should we return a elements.
Since we can't change the tree as we iterate over it, we must do this
before we process our document.
"""
for element in document.iter(tag="div"):
child_tags = tuple(n.tag for n in element.getchildren())
if "div" not in child_tags and "p" not in child_tags:
logger.debug(
"Changing leaf block element <%s> into ", element.tag)
element.tag = "p"
return document
breadability-0.1.20/AUTHORS.txt 0000664 0001750 0001750 00000000103 12271036170 017524 0 ustar rharding rharding 0000000 0000000 Rick Harding (original author)
nhnifong
Craig Maloney
Mišo Belica
breadability-0.1.20/README.rst 0000664 0001750 0001750 00000010763 12271036170 017342 0 ustar rharding rharding 0000000 0000000 breadability - another readability Python (v2.6-v3.3) port
===========================================================
.. image:: https://api.travis-ci.org/bookieio/breadability.png?branch=master
:target: https://travis-ci.org/bookieio/breadability.py
I've tried to work with the various forks of some ancient codebase that ported
`readability`_ to Python. The lack of tests, unused regex's, and commented out
sections of code in other Python ports just drove me nuts.
I put forth an effort to bring in several of the better forks into one
code base, but they've diverged so much that I just can't work with it.
So what's any sane person to do? Re-port it with my own repo, add some tests,
infrastructure, and try to make this port better. OSS FTW (and yea, NIH FML,
but oh well I did try)
This is a pretty straight port of the JS here:
- http://code.google.com/p/arc90labs-readability/source/browse/trunk/js/readability.js#82
- http://www.minvolai.com/blog/decruft-arc90s-readability-in-python/
Alternatives
------------
- https://github.com/codelucas/newspaper
- https://github.com/grangier/python-goose
- https://github.com/aidanf/BTE
- http://www.unixuser.org/~euske/python/webstemmer/#extract
- https://github.com/al3xandru/readability.py
- https://github.com/rcarmo/soup-strainer
- https://github.com/bcampbell/decruft
- https://github.com/gfxmonk/python-readability
- https://github.com/srid/readability
- https://github.com/dcramer/decruft
- https://github.com/reorx/readability
- https://github.com/mote/python-readability
- https://github.com/predatell/python-readability-lxml
- https://github.com/Harshavardhana/boilerpipy
- https://github.com/raptium/hitomi
- https://github.com/kingwkb/readability
Installation
------------
This does depend on lxml so you'll need some C headers in order to install
things from pip so that it can compile.
.. code-block:: bash
$ [sudo] apt-get install libxml2-dev libxslt-dev
$ [sudo] pip install git+git://github.com/bookieio/breadability.git
Tests
-----
.. code-block:: bash
$ nosetests-2.6 tests && nosetests-3.2 tests && nosetests-2.7 tests && nosetests-3.3 tests
Usage
-----
Command line
~~~~~~~~~~~~
.. code-block:: bash
$ breadability http://wiki.python.org/moin/BeginnersGuide
Options
```````
- **b** will write out the parsed content to a temp file and open it in a
browser for viewing.
- **d** will write out debug scoring statements to help track why a node was
chosen as the document and why some nodes were removed from the final
product.
- **f** will override the default behaviour of getting an html fragment ( if it contains elements.
- Fork: Renamed test generation helper 'readability_newtest' -> 'readability_test'.
- Fork: Renamed package to readability. (Renamed back)
- Fork: Added support for Python >= 3.2.
- Fork: Py3k compatible package 'charade' is used instead of 'chardet'.
0.1.14 (Nov 7th 2013)
---------------------
- Update sibling append to only happen when sibling doesn't already exist.
0.1.13 (Aug 31st 2013)
----------------------
- Give images in content boy a better chance of survival
- Add tests
0.1.12 (July 28th 2013)
-----------------------
- Add a user agent to requests.
0.1.11 (Dec 12th 2012)
----------------------
- Add argparse to the install requires for python < 2.7
0.1.10 (Sept 13th 2012)
-----------------------
- Updated scoring bonus and penalty with , and " characters.
0.1.9 (Aug 27nd 2012)
---------------------
- In case of an issue dealing with candidates we need to act like we didn't
find any candidates for the article content. #10
0.1.8 (Aug 27nd 2012)
---------------------
- Add code/tests for an empty document.
- Fixes #9 to handle xml parsing issues.
0.1.7 (July 21nd 2012)
----------------------
- Change the encode 'replace' kwarg into a normal arg for older python
version.
0.1.6 (June 17th 2012)
----------------------
- Fix the link removal, add tests and a place to process other bad links.
0.1.5 (June 16th 2012)
----------------------
- Start to look at removing bad links from content in the conditional cleaning
state. This was really used for the scripting.com site's garbage.
0.1.4 (June 16th 2012)
----------------------
- Add a test generation helper readability_newtest script.
- Add tests and fixes for the scripting news parse failure.
0.1.3 (June 15th 2012)
----------------------
- Add actual testing of full articles for regression tests.
- Update parser to properly clean after winner doc node is chosen.
0.1.2 (May 28th 2012)
---------------------
- Bugfix: #4 issue with logic of the 100char bonus points in scoring
- Garden with PyLint/PEP8
- Add a bunch of tests to readable/scoring code.
0.1.1 (May 11th 2012)
---------------------
- Fix bugs in scoring to help in getting right content
- Add concept of -d which shows scoring/decisions on nodes
- Update command line client to be able to pipe output to other tools
0.1.0 (May 6th 2012)
--------------------
- Initial release and upload to PyPi
Keywords: bookie,breadability,content,HTML,parsing,readability,readable
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 2
Classifier: Programming Language :: Python :: 2.6
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.2
Classifier: Programming Language :: Python :: 3.3
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Pre-processors
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Text Processing :: Markup :: HTML
breadability-0.1.20/breadability.egg-info/dependency_links.txt 0000664 0001750 0001750 00000000001 12322641424 026040 0 ustar rharding rharding 0000000 0000000
breadability-0.1.20/breadability.egg-info/top_level.txt 0000664 0001750 0001750 00000000015 12322641424 024520 0 ustar rharding rharding 0000000 0000000 breadability
breadability-0.1.20/breadability.egg-info/SOURCES.txt 0000664 0001750 0001750 00000002651 12322641424 023662 0 ustar rharding rharding 0000000 0000000 AUTHORS.txt
CHANGELOG.rst
LICENSE.rst
MANIFEST.in
README.rst
setup.cfg
setup.py
breadability/__init__.py
breadability/_compat.py
breadability/annotated_text.py
breadability/document.py
breadability/readable.py
breadability/scoring.py
breadability/utils.py
breadability.egg-info/PKG-INFO
breadability.egg-info/SOURCES.txt
breadability.egg-info/dependency_links.txt
breadability.egg-info/entry_points.txt
breadability.egg-info/not-zip-safe
breadability.egg-info/requires.txt
breadability.egg-info/top_level.txt
breadability/scripts/__init__.py
breadability/scripts/client.py
breadability/scripts/test_helper.py
tests/__init__.py
tests/compat.py
tests/test_annotated_text.py
tests/test_orig_document.py
tests/test_readable.py
tests/test_scoring.py
tests/utils.py
tests/test_articles/__init__.py
tests/test_articles/test_antipope_org/__init__.py
tests/test_articles/test_antipope_org/test_article.py
tests/test_articles/test_businessinsider-com/__init__.py
tests/test_articles/test_businessinsider-com/test_article.py
tests/test_articles/test_businessinsider_com/__init__.py
tests/test_articles/test_businessinsider_com/test_article.py
tests/test_articles/test_cz_zdrojak_tests/__init__.py
tests/test_articles/test_cz_zdrojak_tests/test_article.py
tests/test_articles/test_scripting_com/__init__.py
tests/test_articles/test_scripting_com/test_article.py
tests/test_articles/test_sweetshark/__init__.py
tests/test_articles/test_sweetshark/test_article.py breadability-0.1.20/breadability.egg-info/not-zip-safe 0000664 0001750 0001750 00000000001 12320272414 024215 0 ustar rharding rharding 0000000 0000000
breadability-0.1.20/breadability.egg-info/entry_points.txt 0000664 0001750 0001750 00000000357 12322641424 025275 0 ustar rharding rharding 0000000 0000000 [console_scripts]
breadability = breadability.scripts.client:main
breadability-2.7 = breadability.scripts.client:main
breadability_test = breadability.scripts.test_helper:main
breadability_test-2.7 = breadability.scripts.test_helper:main
breadability-0.1.20/breadability.egg-info/requires.txt 0000664 0001750 0001750 00000000044 12322641424 024370 0 ustar rharding rharding 0000000 0000000 docopt>=0.6.1,<0.7
chardet
lxml>=2.0 breadability-0.1.20/tests/ 0000775 0001750 0001750 00000000000 12322641424 017007 5 ustar rharding rharding 0000000 0000000 breadability-0.1.20/tests/compat.py 0000664 0001750 0001750 00000000320 12271036170 020636 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
try:
import unittest2 as unittest
except ImportError:
import unittest
breadability-0.1.20/tests/test_scoring.py 0000664 0001750 0001750 00000024541 12271036170 022071 0 ustar rharding rharding 0000000 0000000 # -*- coding: utf8 -*-
from __future__ import absolute_import
from __future__ import division, print_function, unicode_literals
import re
from operator import attrgetter
from lxml.html import document_fromstring
from lxml.html import fragment_fromstring
from breadability.readable import Article
from breadability.scoring import (
check_node_attributes,
generate_hash_id,
get_class_weight,
score_candidates,
ScoredNode,
)
from breadability.readable import (
get_link_density,
is_unlikely_node,
)
from .compat import unittest
from .utils import load_snippet
class TestHashId(unittest.TestCase):
def test_generate_hash(self):
dom = fragment_fromstring(" This is a great amount of info And more content Home
How are you? Fine\n I guess How are you? Fine\n I guess tags"""
doc = OriginalDocument(load_snippet('document_min.html'))
self.assertIsNone(doc.dom.find('.//br'))
def test_empty_title(self):
"""We convert all tags"""
document = OriginalDocument(
" tags"""
document = OriginalDocument(
" tags"""
document = OriginalDocument(" simple simplelink ."""
dom = document_fromstring(
" child ')
self.assertEqual(get_class_weight(node), 25)
def test_positive_ids(self):
"""Some ids get us bonus points."""
node = fragment_fromstring(' ')
self.assertEqual(get_class_weight(node), 25)
def test_negative_class(self):
"""Some classes get us negative points."""
node = fragment_fromstring(' ')
self.assertEqual(get_class_weight(node), -25)
def test_negative_ids(self):
"""Some ids get us negative points."""
node = fragment_fromstring(' ')
self.assertEqual(get_class_weight(node), -25)
class TestScoringNodes(unittest.TestCase):
"""We take out list of potential nodes and score them up."""
def test_we_get_candidates(self):
"""Processing candidates should get us a list of nodes to try out."""
doc = document_fromstring(load_article("ars.001.html"))
test_nodes = tuple(doc.iter("p", "td", "pre"))
candidates = score_candidates(test_nodes)
# this might change as we tweak our algorithm, but if it does,
# it signifies we need to look at what we changed.
self.assertEqual(len(candidates.keys()), 37)
# one of these should have a decent score
scores = sorted(c.content_score for c in candidates.values())
self.assertTrue(scores[-1] > 100)
def test_bonus_score_per_100_chars_in_p(self):
"""Nodes get 1 point per 100 characters up to max. 3 points."""
def build_candidates(length):
html = " %s This is text with no annotations This is text\r\twith This is\n\tsimple\ttext. Paragraph \t \n 1 first 2\tsecond 3\rthird text emphasis last text emphasis last
tag and multiple
tags into paragraph.
"""
logger.debug("Converting multiple
&
tags into
1:
return to_unicode("Heading
'
node = fragment_fromstring(test_div)
snode = ScoredNode(node)
self.assertEqual(snode.content_score, -5)
def test_list_items(self):
"""Heading tags aren't likely candidates, hurt their scores."""
test_div = '
How are you?
\t \n
"
"Fine\n I guess
How are you?
\t \n
Fine\n I guess
tags to
tags to
tags to
tags to ',
'
'
]
for l in bad_links:
link = fragment_fromstring(l)
self.assertTrue(is_bad_link(link))
class TestCandidateNodes(unittest.TestCase):
"""Candidate nodes are scoring containers we use."""
def test_candidate_scores(self):
"""We should be getting back objects with some scores."""
fives = ['']
threes = ['', '
', '']
neg_threes = ['', ' ']
neg_fives = ['', '', '', '']
for n in fives:
doc = fragment_fromstring(n)
self.assertEqual(ScoredNode(doc).content_score, 5)
for n in threes:
doc = fragment_fromstring(n)
self.assertEqual(ScoredNode(doc).content_score, 3)
for n in neg_threes:
doc = fragment_fromstring(n)
self.assertEqual(ScoredNode(doc).content_score, -3)
for n in neg_fives:
doc = fragment_fromstring(n)
self.assertEqual(ScoredNode(doc).content_score, -5)
def test_article_enables_candidate_access(self):
"""Candidates are accessible after document processing."""
doc = Article(load_article('ars.001.html'))
self.assertTrue(hasattr(doc, 'candidates'))
class TestClassWeights(unittest.TestCase):
"""Certain ids and classes get us bonus points."""
def test_positive_class(self):
"""Some classes get us bonus points."""
node = fragment_fromstring('
no annotations