pax_global_header 0000666 0000000 0000000 00000000064 13706107607 0014521 g ustar 00root root 0000000 0000000 52 comment=ce3aa606b93b4d07bbe8fecba283a107a4c881be
html-text-0.5.2/ 0000775 0000000 0000000 00000000000 13706107607 0013453 5 ustar 00root root 0000000 0000000 html-text-0.5.2/.coveragerc 0000664 0000000 0000000 00000000024 13706107607 0015570 0 ustar 00root root 0000000 0000000 [run]
branch = True
html-text-0.5.2/.gitignore 0000664 0000000 0000000 00000001344 13706107607 0015445 0 ustar 00root root 0000000 0000000 # Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
.pytest_cache
# Translations
*.mo
*.pot
# Django stuff:
*.log
# Sphinx documentation
docs/_build/
# PyBuilder
target/
html-text-0.5.2/.travis.yml 0000664 0000000 0000000 00000001050 13706107607 0015560 0 ustar 00root root 0000000 0000000 language: python
sudo: false
branches:
only:
- master
- /^\d\.\d+$/
matrix:
include:
- python: 2.7
env: TOXENV=py27
- python: 2.7
env: TOXENV=py27-parsel
- python: 3.5
env: TOXENV=py35
- python: 3.6
env: TOXENV=py36
- python: 3.6
env: TOXENV=py36-parsel
- python: 3.7
env: TOXENV=py37
- python: 3.8
env: TOXENV=py38
install:
- pip install -U pip tox codecov
script: tox
after_success:
- codecov
cache:
directories:
- $HOME/.cache/pip
html-text-0.5.2/CHANGES.rst 0000664 0000000 0000000 00000004044 13706107607 0015257 0 ustar 00root root 0000000 0000000 =======
History
=======
0.5.2 (2020-07-22)
------------------
* Handle lxml Cleaner exceptions (a workaround for
https://bugs.launchpad.net/lxml/+bug/1838497 );
* Python 3.8 support;
* testing improvements.
0.5.1 (2019-05-27)
------------------
Fixed whitespace handling when ``guess_punct_space`` is False: html-text was
producing unnecessary spaces after newlines.
0.5.0 (2018-11-19)
------------------
Parsel dependency is removed in this release,
though parsel is still supported.
* ``parsel`` package is no longer required to install and use html-text;
* ``html_text.etree_to_text`` function allows to extract text from
lxml Elements;
* ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance with
options tuned for text extraction speed and quality;
* test and documentation improvements;
* Python 3.7 support.
0.4.1 (2018-09-25)
------------------
Fixed a regression in 0.4.0 release: text was empty when
``html_text.extract_text`` is called with a node with text, but
without children.
0.4.0 (2018-09-25)
------------------
This is a backwards-incompatible release: by default html_text functions
now add newlines after elements, if appropriate, to make the extracted text
to look more like how it is rendered in a browser.
To turn it off, pass ``guess_layout=False`` option to html_text functions.
* ``guess_layout`` option to to make extracted text look more like how
it is rendered in browser.
* Add tests of layout extraction for real webpages.
0.3.0 (2017-10-12)
------------------
* Expose functions that operate on selectors,
use ``.//text()`` to extract text from selector.
0.2.1 (2017-05-29)
------------------
* Packaging fix (include CHANGES.rst)
0.2.0 (2017-05-29)
------------------
* Fix unwanted joins of words with inline tags: spaces are added for inline
tags too, but a heuristic is used to preserve punctuation without extra spaces.
* Accept parsed html trees.
0.1.1 (2017-01-16)
------------------
* Travis-CI and codecov.io integrations added
0.1.0 (2016-09-27)
------------------
* First release on PyPI.
html-text-0.5.2/LICENSE 0000664 0000000 0000000 00000002067 13706107607 0014465 0 ustar 00root root 0000000 0000000
MIT License
Copyright (c) 2016, Konstantin Lopukhin
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
html-text-0.5.2/MANIFEST.in 0000664 0000000 0000000 00000000363 13706107607 0015213 0 ustar 00root root 0000000 0000000
include CONTRIBUTING.rst
include CHANGES.rst
include LICENSE
include README.rst
recursive-include tests *
recursive-exclude * __pycache__
recursive-exclude * *.py[co]
recursive-include docs *.rst conf.py Makefile make.bat *.jpg *.png *.gif
html-text-0.5.2/README.rst 0000664 0000000 0000000 00000011274 13706107607 0015147 0 ustar 00root root 0000000 0000000 ============
HTML to Text
============
.. image:: https://img.shields.io/pypi/v/html-text.svg
:target: https://pypi.python.org/pypi/html-text
:alt: PyPI Version
.. image:: https://img.shields.io/travis/TeamHG-Memex/html-text.svg
:target: https://travis-ci.org/TeamHG-Memex/html-text
:alt: Build Status
.. image:: http://codecov.io/github/TeamHG-Memex/soft404/coverage.svg?branch=master
:target: http://codecov.io/github/TeamHG-Memex/html-text?branch=master
:alt: Code Coverage
Extract text from HTML
* Free software: MIT license
How is html_text different from ``.xpath('//text()')`` from LXML
or ``.get_text()`` from Beautiful Soup?
* Text extracted with ``html_text`` does not contain inline styles,
javascript, comments and other text that is not normally visible to users;
* ``html_text`` normalizes whitespace, but in a way smarter than
``.xpath('normalize-space())``, adding spaces around inline elements
(which are often used as block elements in html markup), and trying to
avoid adding extra spaces for punctuation;
* ``html-text`` can add newlines (e.g. after headers or paragraphs), so
that the output text looks more like how it is rendered in browsers.
Install
-------
Install with pip::
pip install html-text
The package depends on lxml, so you might need to install additional
packages: http://lxml.de/installation.html
Usage
-----
Extract text from HTML::
>>> import html_text
>>> html_text.extract_text('
world!', guess_layout=False)
'Hello world!'
Passed html is first cleaned from invisible non-text content such
as styles, and then text is extracted.
You can also pass an already parsed ``lxml.html.HtmlElement``:
>>> import html_text
>>> tree = html_text.parse_html('
Hello
world!')
>>> html_text.extract_text(tree)
'Hello\n\nworld!'
If you want, you can handle cleaning manually; use lower-level
``html_text.etree_to_text`` in this case:
>>> import html_text
>>> tree = html_text.parse_html('
Hello!
')
>>> cleaned_tree = html_text.cleaner.clean_html(tree)
>>> html_text.etree_to_text(cleaned_tree)
'Hello!'
parsel.Selector objects are also supported; you can define
a parsel.Selector to extract text only from specific elements:
>>> import html_text
>>> sel = html_text.cleaned_selector('
Hello
world!')
>>> subsel = sel.xpath('//h1')
>>> html_text.selector_to_text(subsel)
'Hello'
NB parsel.Selector objects are not cleaned automatically, you need to call
``html_text.cleaned_selector`` first.
Main functions and objects:
* ``html_text.extract_text`` accepts html and returns extracted text.
* ``html_text.etree_to_text`` accepts parsed lxml Element and returns
extracted text; it is a lower-level function, cleaning is not handled
here.
* ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which
can be used with ``html_text.etree_to_text``; its options are tuned for
speed and text extraction quality.
* ``html_text.cleaned_selector`` accepts html as text or as
``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``.
* ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns
extracted text.
If ``guess_layout`` is True (default), a newline is added before and after
``newline_tags``, and two newlines are added before and after
``double_newline_tags``. This heuristic makes the extracted text
more similar to how it is rendered in the browser. Default newline and double
newline tags can be found in `html_text.NEWLINE_TAGS`
and `html_text.DOUBLE_NEWLINE_TAGS`.
It is possible to customize how newlines are added, using ``newline_tags`` and
``double_newline_tags`` arguments (which are `html_text.NEWLINE_TAGS` and
`html_text.DOUBLE_NEWLINE_TAGS` by default). For example, don't add a newline
after ``
world!',
... newline_tags=newline_tags)
'Hello world!'
Apart from just getting text from the page (e.g. for display or search),
one intended usage of this library is for machine learning (feature extraction).
If you want to use the text of the html page as a feature (e.g. for classification),
this library gives you plain text that you can later feed into a standard text
classification pipeline.
If you feel that you need html structure as well, check out
`webstruct `_ library.
----
.. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg
:target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=html-text
:alt: define hyperiongray
html-text-0.5.2/codecov.yml 0000664 0000000 0000000 00000000120 13706107607 0015611 0 ustar 00root root 0000000 0000000 comment:
layout: "header, diff, tree"
coverage:
status:
project: false
html-text-0.5.2/html_text/ 0000775 0000000 0000000 00000000000 13706107607 0015463 5 ustar 00root root 0000000 0000000 html-text-0.5.2/html_text/__init__.py 0000664 0000000 0000000 00000000360 13706107607 0017573 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
__version__ = '0.5.2'
from .html_text import (etree_to_text, extract_text, selector_to_text,
parse_html, cleaned_selector, cleaner,
NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS)
html-text-0.5.2/html_text/html_text.py 0000664 0000000 0000000 00000016172 13706107607 0020054 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
import re
import lxml
import lxml.etree
from lxml.html.clean import Cleaner
NEWLINE_TAGS = frozenset([
'article', 'aside', 'br', 'dd', 'details', 'div', 'dt', 'fieldset',
'figcaption', 'footer', 'form', 'header', 'hr', 'legend', 'li', 'main',
'nav', 'table', 'tr'
])
DOUBLE_NEWLINE_TAGS = frozenset([
'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol',
'p', 'pre', 'title', 'ul'
])
cleaner = Cleaner(
scripts=True,
javascript=False, # onclick attributes are fine
comments=True,
style=True,
links=True,
meta=True,
page_structure=False, # may be nice to have
processing_instructions=True,
embedded=True,
frames=True,
forms=False, # keep forms
annoying_tags=False,
remove_unknown_tags=False,
safe_attrs_only=False,
)
def _cleaned_html_tree(html):
if isinstance(html, lxml.html.HtmlElement):
tree = html
else:
tree = parse_html(html)
# we need this as https://bugs.launchpad.net/lxml/+bug/1838497
try:
cleaned = cleaner.clean_html(tree)
except AssertionError:
cleaned = tree
return cleaned
def parse_html(html):
""" Create an lxml.html.HtmlElement from a string with html.
XXX: mostly copy-pasted from parsel.selector.create_root_node
"""
body = html.strip().replace('\x00', '').encode('utf8') or b''
parser = lxml.html.HTMLParser(recover=True, encoding='utf8')
root = lxml.etree.fromstring(body, parser=parser)
if root is None:
root = lxml.etree.fromstring(b'
', parser=parser)
return root
_whitespace = re.compile(r'\s+')
_has_trailing_whitespace = re.compile(r'\s$').search
_has_punct_after = re.compile(r'^[,:;.!?")]').search
_has_open_bracket_before = re.compile(r'\($').search
def _normalize_whitespace(text):
return _whitespace.sub(' ', text.strip())
def etree_to_text(tree,
guess_punct_space=True,
guess_layout=True,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert a html tree to text. Tree should be cleaned with
``html_text.html_text.cleaner.clean_html`` before passing to this
function.
See html_text.extract_text docstring for description of the
approach and options.
"""
chunks = []
_NEWLINE = object()
_DOUBLE_NEWLINE = object()
class Context:
""" workaround for missing `nonlocal` in Python 2 """
# _NEWLINE, _DOUBLE_NEWLINE or content of the previous chunk (str)
prev = _DOUBLE_NEWLINE
def should_add_space(text, prev):
""" Return True if extra whitespace should be added before text """
if prev in {_NEWLINE, _DOUBLE_NEWLINE}:
return False
if not guess_punct_space:
return True
if not _has_trailing_whitespace(prev):
if _has_punct_after(text) or _has_open_bracket_before(prev):
return False
return True
def get_space_between(text, prev):
if not text:
return ' '
return ' ' if should_add_space(text, prev) else ''
def add_newlines(tag, context):
if not guess_layout:
return
prev = context.prev
if prev is _DOUBLE_NEWLINE: # don't output more than 1 blank line
return
if tag in double_newline_tags:
context.prev = _DOUBLE_NEWLINE
chunks.append('\n' if prev is _NEWLINE else '\n\n')
elif tag in newline_tags:
context.prev = _NEWLINE
if prev is not _NEWLINE:
chunks.append('\n')
def add_text(text_content, context):
text = _normalize_whitespace(text_content) if text_content else ''
if not text:
return
space = get_space_between(text, context.prev)
chunks.extend([space, text])
context.prev = text_content
def traverse_text_fragments(tree, context, handle_tail=True):
""" Extract text from the ``tree``: fill ``chunks`` variable """
add_newlines(tree.tag, context)
add_text(tree.text, context)
for child in tree:
traverse_text_fragments(child, context)
add_newlines(tree.tag, context)
if handle_tail:
add_text(tree.tail, context)
traverse_text_fragments(tree, context=Context(), handle_tail=False)
return ''.join(chunks).strip()
def selector_to_text(sel, guess_punct_space=True, guess_layout=True):
""" Convert a cleaned parsel.Selector to text.
See html_text.extract_text docstring for description of the approach
and options.
"""
import parsel
if isinstance(sel, parsel.SelectorList):
# if selecting a specific xpath
text = []
for s in sel:
extracted = etree_to_text(
s.root,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout)
if extracted:
text.append(extracted)
return ' '.join(text)
else:
return etree_to_text(
sel.root,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout)
def cleaned_selector(html):
""" Clean parsel.selector.
"""
import parsel
try:
tree = _cleaned_html_tree(html)
sel = parsel.Selector(root=tree, type='html')
except (lxml.etree.XMLSyntaxError,
lxml.etree.ParseError,
lxml.etree.ParserError,
UnicodeEncodeError):
# likely plain text
sel = parsel.Selector(html)
return sel
def extract_text(html,
guess_punct_space=True,
guess_layout=True,
newline_tags=NEWLINE_TAGS,
double_newline_tags=DOUBLE_NEWLINE_TAGS):
"""
Convert html to text, cleaning invisible content such as styles.
Almost the same as normalize-space xpath, but this also
adds spaces between inline elements (like ) which are
often used as block elements in html markup, and adds appropriate
newlines to make output better formatted.
html should be a unicode string or an already parsed lxml.html element.
``html_text.etree_to_text`` is a lower-level function which only accepts
an already parsed lxml.html Element, and is not doing html cleaning itself.
When guess_punct_space is True (default), no extra whitespace is added
for punctuation. This has a slight (around 10%) performance overhead
and is just a heuristic.
When guess_layout is True (default), a newline is added
before and after ``newline_tags`` and two newlines are added before
and after ``double_newline_tags``. This heuristic makes the extracted
text more similar to how it is rendered in the browser.
Default newline and double newline tags can be found in
`html_text.NEWLINE_TAGS` and `html_text.DOUBLE_NEWLINE_TAGS`.
"""
if html is None:
return ''
cleaned = _cleaned_html_tree(html)
return etree_to_text(
cleaned,
guess_punct_space=guess_punct_space,
guess_layout=guess_layout,
newline_tags=newline_tags,
double_newline_tags=double_newline_tags,
)
html-text-0.5.2/setup.cfg 0000664 0000000 0000000 00000000526 13706107607 0015277 0 ustar 00root root 0000000 0000000 [bumpversion]
current_version = 0.5.2
commit = True
tag = True
tag_name = {new_version}
[bumpversion:file:setup.py]
search = version='{current_version}'
replace = version='{new_version}'
[bumpversion:file:html_text/__init__.py]
search = __version__ = '{current_version}'
replace = __version__ = '{new_version}'
[bdist_wheel]
universal = 1
html-text-0.5.2/setup.py 0000775 0000000 0000000 00000002317 13706107607 0015173 0 ustar 00root root 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
from setuptools import setup
with open('README.rst') as readme_file:
readme = readme_file.read()
with open('CHANGES.rst') as history_file:
history = history_file.read()
setup(
name='html_text',
version='0.5.2',
description="Extract text from HTML",
long_description=readme + '\n\n' + history,
author="Konstantin Lopukhin",
author_email='kostia.lopuhin@gmail.com',
url='https://github.com/TeamHG-Memex/html-text',
packages=['html_text'],
include_package_data=True,
install_requires=['lxml'],
license="MIT license",
zip_safe=False,
classifiers=[
'Development Status :: 4 - Beta',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Natural Language :: English',
"Programming Language :: Python :: 2",
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
],
test_suite='tests',
tests_require=['pytest'],
)
html-text-0.5.2/tests/ 0000775 0000000 0000000 00000000000 13706107607 0014615 5 ustar 00root root 0000000 0000000 html-text-0.5.2/tests/__init__.py 0000664 0000000 0000000 00000000030 13706107607 0016717 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
html-text-0.5.2/tests/test_html_text.py 0000664 0000000 0000000 00000017322 13706107607 0020243 0 ustar 00root root 0000000 0000000 # -*- coding: utf-8 -*-
import glob
import os
import six
import pytest
from html_text import (extract_text, parse_html, cleaned_selector,
etree_to_text, cleaner, selector_to_text, NEWLINE_TAGS,
DOUBLE_NEWLINE_TAGS)
ROOT = os.path.dirname(os.path.abspath(__file__))
@pytest.fixture(params=[
{'guess_punct_space': True, 'guess_layout': False},
{'guess_punct_space': False, 'guess_layout': False},
{'guess_punct_space': True, 'guess_layout': True},
{'guess_punct_space': False, 'guess_layout': True}
])
def all_options(request):
return request.param
def test_extract_no_text_html(all_options):
html = (u'')
assert extract_text(html, **all_options) == u''
def test_extract_text(all_options):
html = (u''
'
Hello, world!')
tree = parse_html(html)
assert extract_text(tree, **all_options) == u'Hello, world!'
def test_extract_text_from_node(all_options):
html = (u''
'
Hello, world!
')
tree = parse_html(html)
node = tree.xpath('//p')[0]
assert extract_text(node, **all_options) == u'Hello, world!'
def test_inline_tags_whitespace(all_options):
html = u'fieldvalue of'
assert extract_text(html, **all_options) == u'field value of'
def test_extract_text_from_fail_html():
html = ""
tree = parse_html(html)
node = tree.xpath('/html/frameset')[0]
assert extract_text(node) == u''
def test_punct_whitespace():
html = u'
field, and more
'
assert extract_text(html, guess_punct_space=False) == u'field , and more'
assert extract_text(html, guess_punct_space=True) == u'field, and more'
def test_punct_whitespace_preserved():
html = (u'
поле, and , '
u'more !now
a (boo)')
text = extract_text(html, guess_punct_space=True, guess_layout=False)
assert text == u'по ле, and , more ! now a (boo)'
@pytest.mark.xfail(reason="code punctuation should be handled differently")
def test_bad_punct_whitespace():
html = (u'
trees '
'=webstruct'
'.load_trees'
'("train/*.html"'
')
')
text = extract_text(html, guess_punct_space=False, guess_layout=False)
assert text == u'trees = webstruct . load_trees ( "train/*.html" )'
text = extract_text(html, guess_punct_space=True, guess_layout=False)
assert text == u'trees = webstruct.load_trees("train/*.html")'
def test_selectors(all_options):
pytest.importorskip("parsel")
html = (u'textmore'
'and more text and some more')
# Selector
sel = cleaned_selector(html)
assert selector_to_text(sel, **all_options) == 'text more and more text and some more'
# SelectorList
subsel = sel.xpath('//span[@id="extract-me"]')
assert selector_to_text(subsel, **all_options) == 'text more'
subsel = sel.xpath('//a')
assert selector_to_text(subsel, **all_options) == 'more and some more'
subsel = sel.xpath('//a[@id="extract-me"]')
assert selector_to_text(subsel, **all_options) == ''
subsel = sel.xpath('//foo')
assert selector_to_text(subsel, **all_options) == ''
def test_nbsp():
if six.PY2:
raise pytest.xfail(" produces '\xa0' in Python 2, "
"but ' ' in Python 3")
html = "
Foo Bar
"
assert extract_text(html) == "Foo Bar"
def test_guess_layout():
html = (u' title
text_1.
text_2 text_3
'
'
text_4
text_5
'
'
text_6text_7text_8
text_9
'
'
...text_10
')
text = 'title text_1. text_2 text_3 text_4 text_5 text_6 text_7 ' \
'text_8 text_9 ...text_10'
assert extract_text(html, guess_punct_space=False, guess_layout=False) == text
text = ('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5'
'\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10')
assert extract_text(html, guess_punct_space=False, guess_layout=True) == text
text = 'title text_1. text_2 text_3 text_4 text_5 text_6 text_7 ' \
'text_8 text_9...text_10'
assert extract_text(html, guess_punct_space=True, guess_layout=False) == text
text = 'title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5\n\n' \
'text_6 text_7 text_8\n\ntext_9\n\n...text_10'
assert extract_text(html, guess_punct_space=True, guess_layout=True) == text
def test_basic_newline():
html = u'
a
b
'
assert extract_text(html, guess_punct_space=False, guess_layout=False) == 'a b'
assert extract_text(html, guess_punct_space=False, guess_layout=True) == 'a\nb'
assert extract_text(html, guess_punct_space=True, guess_layout=False) == 'a b'
assert extract_text(html, guess_punct_space=True, guess_layout=True) == 'a\nb'
def test_adjust_newline():
html = u'
text 1
text 2
'
assert extract_text(html, guess_layout=True) == 'text 1\n\ntext 2'
def test_personalize_newlines_sets():
html = (u'textmore'
'and more text and some more')
text = extract_text(html, guess_layout=True,
newline_tags=NEWLINE_TAGS | {'a'})
assert text == 'text\nmore\nand more text\nand some more'
text = extract_text(html, guess_layout=True,
double_newline_tags=DOUBLE_NEWLINE_TAGS | {'a'})
assert text == 'text\n\nmore\n\nand more text\n\nand some more'
def _webpage_paths():
webpages = sorted(glob.glob(os.path.join(ROOT, 'test_webpages', '*.html')))
extracted = sorted(glob.glob(os.path.join(ROOT, 'test_webpages','*.txt')))
return list(zip(webpages, extracted))
def _load_file(path):
with open(path, 'rb') as f:
return f.read().decode('utf8')
@pytest.mark.parametrize(['page', 'extracted'], _webpage_paths())
def test_webpages(page, extracted):
html = _load_file(page)
if not six.PY3:
# FIXME: produces '\xa0' in Python 2, but ' ' in Python 3
# this difference is ignored in this test.
# What is the correct behavior?
html = html.replace(' ', ' ')
expected = _load_file(extracted)
assert extract_text(html) == expected
tree = cleaner.clean_html(parse_html(html))
assert etree_to_text(tree) == expected
html-text-0.5.2/tests/test_webpages/ 0000775 0000000 0000000 00000000000 13706107607 0017451 5 ustar 00root root 0000000 0000000 html-text-0.5.2/tests/test_webpages/A Light in the Attic | Books to Scrape - Sandbox.html 0000664 0000000 0000000 00000022077 13706107607 0030523 0 ustar 00root root 0000000 0000000
A Light in the Attic | Books to Scrape - Sandbox
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
Product Description
It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more
Product Information
UPC
a897fe39b1053632
Product Type
Books
Price (excl. tax)
£51.77
Price (incl. tax)
£51.77
Tax
£0.00
Availability
In stock (22 available)
Number of reviews
0
html-text-0.5.2/tests/test_webpages/A Light in the Attic | Books to Scrape - Sandbox.txt 0000664 0000000 0000000 00000003003 13706107607 0030362 0 ustar 00root root 0000000 0000000 A Light in the Attic | Books to Scrape - Sandbox
Books to Scrape We love being scraped!
Home
Books
Poetry
A Light in the Attic
A Light in the Attic
£51.77
In stock (22 available)
Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning.
Product Description
It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more
Product Information
UPC a897fe39b1053632
Product Type Books
Price (excl. tax) £51.77
Price (incl. tax) £51.77
Tax £0.00
Availability In stock (22 available)
Number of reviews 0 html-text-0.5.2/tests/test_webpages/IANA — IANA-managed Reserved Domains.html 0000664 0000000 0000000 00000023761 13706107607 0027304 0 ustar 00root root 0000000 0000000
IANA — IANA-managed Reserved Domains
Certain domains are set aside, and nominally registered to “IANA”, for specific
policy or technical purposes.
Example domains
As described in RFC 2606 and RFC 6761,
a number of domains such as example.com and example.org
are maintained for documentation purposes. These domains may be used as illustrative
examples in documents without prior coordination with us. They are
not available for registration or transfer.
Test IDN top-level domains
These domains were temporarily delegated by IANA for the IDN Evaluation being conducted by ICANN.
We act as both the registrant and registrar for a select number of domains
which have been reserved under policy grounds. These exclusions are
typically indicated in either technical standards (RFC documents),
or contractual limitations.
Domains which are described as registered to IANA or ICANN on policy
grounds are not available for registration or transfer, with the exception
of country-name.info domains. These domains are available for release
by the ICANN Governmental Advisory Committee Secretariat.
html-text-0.5.2/tests/test_webpages/IANA — IANA-managed Reserved Domains.txt 0000664 0000000 0000000 00000005444 13706107607 0027155 0 ustar 00root root 0000000 0000000 IANA — IANA-managed Reserved Domains
Domains
Numbers
Protocols
About Us
IANA-managed Reserved Domains
Certain domains are set aside, and nominally registered to “IANA”, for specific policy or technical purposes.
Example domains
As described in RFC 2606 and RFC 6761, a number of domains such as example.com and example.org are maintained for documentation purposes. These domains may be used as illustrative examples in documents without prior coordination with us. They are not available for registration or transfer.
Test IDN top-level domains
These domains were temporarily delegated by IANA for the IDN Evaluation being conducted by ICANN.
Domain Domain (A-label) Language Script
إختبار XN--KGBECHTV Arabic Arabic
آزمایشی XN--HGBK6AJ7F53BBA Persian Arabic
测试 XN--0ZWM56D Chinese Han (Simplified variant)
測試 XN--G6W251D Chinese Han (Traditional variant)
испытание XN--80AKHBYKNJ4F Russian Cyrillic
परीक्षा XN--11B5BS3A9AJ6G Hindi Devanagari (Nagari)
δοκιμή XN--JXALPDLP Greek, Modern (1453-) Greek
테스트 XN--9T4B11YI5A Korean Hangul (Hangŭl, Hangeul)
טעסט XN--DEBA0AD Yiddish Hebrew
テスト XN--ZCKZAH Japanese Katakana
பரிட்சை XN--HLCJ6AYA9ESC7A Tamil Tamil
Policy-reserved domains
We act as both the registrant and registrar for a select number of domains which have been reserved under policy grounds. These exclusions are typically indicated in either technical standards (RFC documents), or contractual limitations.
Domains which are described as registered to IANA or ICANN on policy grounds are not available for registration or transfer, with the exception of country-name.info domains. These domains are available for release by the ICANN Governmental Advisory Committee Secretariat.
Other Special-Use Domains
There is additionally a Special-Use Domain Names registry documenting special-use domains designated by technical standards. For further information, see Special-Use Domain Names (RFC 6761).
Domain Names
Overview
Root Zone Management
Overview
Root Database
Hint and Zone Files
Change Requests
Instructions & Guides
Root Servers
.INT Registry
Overview
Register/modify an .INT domain
Eligibility
.ARPA Registry
IDN Practices Repository
Overview
Submit a table
Root Key Signing Key (DNSSEC)
Overview
Trusts Anchors and Keys
Root KSK Ceremonies
Practice Statement
Community Representatives
Reserved Domains
Domain Names
Root Zone Registry
.INT Registry
.ARPA Registry
IDN Repository
Number Resources
Abuse Information
Protocols
Protocol Registries
Time Zone Database
About Us
Presentations
Reports
Performance
Reviews
Excellence
Contact Us
The IANA functions coordinate the Internet’s globally unique identifiers, and are provided by Public Technical Identifiers, an affiliate of ICANN.
Privacy Policy
Terms of Service html-text-0.5.2/tests/test_webpages/Scrapinghub Enterprise Solutions.html 0000664 0000000 0000000 00000203115 13706107607 0026667 0 ustar 00root root 0000000 0000000 Scrapinghub Enterprise Solutions
Enterprise Solutions
Complete web scraping services for any size business, from startups to Fortune 100’s