pax_global_header00006660000000000000000000000064137061076070014521gustar00rootroot0000000000000052 comment=ce3aa606b93b4d07bbe8fecba283a107a4c881be html-text-0.5.2/000077500000000000000000000000001370610760700134535ustar00rootroot00000000000000html-text-0.5.2/.coveragerc000066400000000000000000000000241370610760700155700ustar00rootroot00000000000000[run] branch = True html-text-0.5.2/.gitignore000066400000000000000000000013441370610760700154450ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *,cover .hypothesis/ .pytest_cache # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ html-text-0.5.2/.travis.yml000066400000000000000000000010501370610760700155600ustar00rootroot00000000000000language: python sudo: false branches: only: - master - /^\d\.\d+$/ matrix: include: - python: 2.7 env: TOXENV=py27 - python: 2.7 env: TOXENV=py27-parsel - python: 3.5 env: TOXENV=py35 - python: 3.6 env: TOXENV=py36 - python: 3.6 env: TOXENV=py36-parsel - python: 3.7 env: TOXENV=py37 - python: 3.8 env: TOXENV=py38 install: - pip install -U pip tox codecov script: tox after_success: - codecov cache: directories: - $HOME/.cache/pip html-text-0.5.2/CHANGES.rst000066400000000000000000000040441370610760700152570ustar00rootroot00000000000000======= History ======= 0.5.2 (2020-07-22) ------------------ * Handle lxml Cleaner exceptions (a workaround for https://bugs.launchpad.net/lxml/+bug/1838497 ); * Python 3.8 support; * testing improvements. 0.5.1 (2019-05-27) ------------------ Fixed whitespace handling when ``guess_punct_space`` is False: html-text was producing unnecessary spaces after newlines. 0.5.0 (2018-11-19) ------------------ Parsel dependency is removed in this release, though parsel is still supported. * ``parsel`` package is no longer required to install and use html-text; * ``html_text.etree_to_text`` function allows to extract text from lxml Elements; * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance with options tuned for text extraction speed and quality; * test and documentation improvements; * Python 3.7 support. 0.4.1 (2018-09-25) ------------------ Fixed a regression in 0.4.0 release: text was empty when ``html_text.extract_text`` is called with a node with text, but without children. 0.4.0 (2018-09-25) ------------------ This is a backwards-incompatible release: by default html_text functions now add newlines after elements, if appropriate, to make the extracted text to look more like how it is rendered in a browser. To turn it off, pass ``guess_layout=False`` option to html_text functions. * ``guess_layout`` option to to make extracted text look more like how it is rendered in browser. * Add tests of layout extraction for real webpages. 0.3.0 (2017-10-12) ------------------ * Expose functions that operate on selectors, use ``.//text()`` to extract text from selector. 0.2.1 (2017-05-29) ------------------ * Packaging fix (include CHANGES.rst) 0.2.0 (2017-05-29) ------------------ * Fix unwanted joins of words with inline tags: spaces are added for inline tags too, but a heuristic is used to preserve punctuation without extra spaces. * Accept parsed html trees. 0.1.1 (2017-01-16) ------------------ * Travis-CI and codecov.io integrations added 0.1.0 (2016-09-27) ------------------ * First release on PyPI. html-text-0.5.2/LICENSE000066400000000000000000000020671370610760700144650ustar00rootroot00000000000000 MIT License Copyright (c) 2016, Konstantin Lopukhin Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. html-text-0.5.2/MANIFEST.in000066400000000000000000000003631370610760700152130ustar00rootroot00000000000000 include CONTRIBUTING.rst include CHANGES.rst include LICENSE include README.rst recursive-include tests * recursive-exclude * __pycache__ recursive-exclude * *.py[co] recursive-include docs *.rst conf.py Makefile make.bat *.jpg *.png *.gif html-text-0.5.2/README.rst000066400000000000000000000112741370610760700151470ustar00rootroot00000000000000============ HTML to Text ============ .. image:: https://img.shields.io/pypi/v/html-text.svg :target: https://pypi.python.org/pypi/html-text :alt: PyPI Version .. image:: https://img.shields.io/travis/TeamHG-Memex/html-text.svg :target: https://travis-ci.org/TeamHG-Memex/html-text :alt: Build Status .. image:: http://codecov.io/github/TeamHG-Memex/soft404/coverage.svg?branch=master :target: http://codecov.io/github/TeamHG-Memex/html-text?branch=master :alt: Code Coverage Extract text from HTML * Free software: MIT license How is html_text different from ``.xpath('//text()')`` from LXML or ``.get_text()`` from Beautiful Soup? * Text extracted with ``html_text`` does not contain inline styles, javascript, comments and other text that is not normally visible to users; * ``html_text`` normalizes whitespace, but in a way smarter than ``.xpath('normalize-space())``, adding spaces around inline elements (which are often used as block elements in html markup), and trying to avoid adding extra spaces for punctuation; * ``html-text`` can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers. Install ------- Install with pip:: pip install html-text The package depends on lxml, so you might need to install additional packages: http://lxml.de/installation.html Usage ----- Extract text from HTML:: >>> import html_text >>> html_text.extract_text('

Hello

world!') 'Hello\n\nworld!' >>> html_text.extract_text('

Hello

world!', guess_layout=False) 'Hello world!' Passed html is first cleaned from invisible non-text content such as styles, and then text is extracted. You can also pass an already parsed ``lxml.html.HtmlElement``: >>> import html_text >>> tree = html_text.parse_html('

Hello

world!') >>> html_text.extract_text(tree) 'Hello\n\nworld!' If you want, you can handle cleaning manually; use lower-level ``html_text.etree_to_text`` in this case: >>> import html_text >>> tree = html_text.parse_html('

Hello!

') >>> cleaned_tree = html_text.cleaner.clean_html(tree) >>> html_text.etree_to_text(cleaned_tree) 'Hello!' parsel.Selector objects are also supported; you can define a parsel.Selector to extract text only from specific elements: >>> import html_text >>> sel = html_text.cleaned_selector('

Hello

world!') >>> subsel = sel.xpath('//h1') >>> html_text.selector_to_text(subsel) 'Hello' NB parsel.Selector objects are not cleaned automatically, you need to call ``html_text.cleaned_selector`` first. Main functions and objects: * ``html_text.extract_text`` accepts html and returns extracted text. * ``html_text.etree_to_text`` accepts parsed lxml Element and returns extracted text; it is a lower-level function, cleaning is not handled here. * ``html_text.cleaner`` is an ``lxml.html.clean.Cleaner`` instance which can be used with ``html_text.etree_to_text``; its options are tuned for speed and text extraction quality. * ``html_text.cleaned_selector`` accepts html as text or as ``lxml.html.HtmlElement``, and returns cleaned ``parsel.Selector``. * ``html_text.selector_to_text`` accepts ``parsel.Selector`` and returns extracted text. If ``guess_layout`` is True (default), a newline is added before and after ``newline_tags``, and two newlines are added before and after ``double_newline_tags``. This heuristic makes the extracted text more similar to how it is rendered in the browser. Default newline and double newline tags can be found in `html_text.NEWLINE_TAGS` and `html_text.DOUBLE_NEWLINE_TAGS`. It is possible to customize how newlines are added, using ``newline_tags`` and ``double_newline_tags`` arguments (which are `html_text.NEWLINE_TAGS` and `html_text.DOUBLE_NEWLINE_TAGS` by default). For example, don't add a newline after ``
`` tags: >>> newline_tags = html_text.NEWLINE_TAGS - {'div'} >>> html_text.extract_text('
Hello
world!', ... newline_tags=newline_tags) 'Hello world!' Apart from just getting text from the page (e.g. for display or search), one intended usage of this library is for machine learning (feature extraction). If you want to use the text of the html page as a feature (e.g. for classification), this library gives you plain text that you can later feed into a standard text classification pipeline. If you feel that you need html structure as well, check out `webstruct `_ library. ---- .. image:: https://hyperiongray.s3.amazonaws.com/define-hg.svg :target: https://www.hyperiongray.com/?pk_campaign=github&pk_kwd=html-text :alt: define hyperiongray html-text-0.5.2/codecov.yml000066400000000000000000000001201370610760700156110ustar00rootroot00000000000000comment: layout: "header, diff, tree" coverage: status: project: false html-text-0.5.2/html_text/000077500000000000000000000000001370610760700154635ustar00rootroot00000000000000html-text-0.5.2/html_text/__init__.py000066400000000000000000000003601370610760700175730ustar00rootroot00000000000000# -*- coding: utf-8 -*- __version__ = '0.5.2' from .html_text import (etree_to_text, extract_text, selector_to_text, parse_html, cleaned_selector, cleaner, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS) html-text-0.5.2/html_text/html_text.py000066400000000000000000000161721370610760700200540ustar00rootroot00000000000000# -*- coding: utf-8 -*- import re import lxml import lxml.etree from lxml.html.clean import Cleaner NEWLINE_TAGS = frozenset([ 'article', 'aside', 'br', 'dd', 'details', 'div', 'dt', 'fieldset', 'figcaption', 'footer', 'form', 'header', 'hr', 'legend', 'li', 'main', 'nav', 'table', 'tr' ]) DOUBLE_NEWLINE_TAGS = frozenset([ 'blockquote', 'dl', 'figure', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'ol', 'p', 'pre', 'title', 'ul' ]) cleaner = Cleaner( scripts=True, javascript=False, # onclick attributes are fine comments=True, style=True, links=True, meta=True, page_structure=False, # may be nice to have processing_instructions=True, embedded=True, frames=True, forms=False, # keep forms annoying_tags=False, remove_unknown_tags=False, safe_attrs_only=False, ) def _cleaned_html_tree(html): if isinstance(html, lxml.html.HtmlElement): tree = html else: tree = parse_html(html) # we need this as https://bugs.launchpad.net/lxml/+bug/1838497 try: cleaned = cleaner.clean_html(tree) except AssertionError: cleaned = tree return cleaned def parse_html(html): """ Create an lxml.html.HtmlElement from a string with html. XXX: mostly copy-pasted from parsel.selector.create_root_node """ body = html.strip().replace('\x00', '').encode('utf8') or b'<html/>' parser = lxml.html.HTMLParser(recover=True, encoding='utf8') root = lxml.etree.fromstring(body, parser=parser) if root is None: root = lxml.etree.fromstring(b'<html/>', parser=parser) return root _whitespace = re.compile(r'\s+') _has_trailing_whitespace = re.compile(r'\s$').search _has_punct_after = re.compile(r'^[,:;.!?")]').search _has_open_bracket_before = re.compile(r'\($').search def _normalize_whitespace(text): return _whitespace.sub(' ', text.strip()) def etree_to_text(tree, guess_punct_space=True, guess_layout=True, newline_tags=NEWLINE_TAGS, double_newline_tags=DOUBLE_NEWLINE_TAGS): """ Convert a html tree to text. Tree should be cleaned with ``html_text.html_text.cleaner.clean_html`` before passing to this function. See html_text.extract_text docstring for description of the approach and options. """ chunks = [] _NEWLINE = object() _DOUBLE_NEWLINE = object() class Context: """ workaround for missing `nonlocal` in Python 2 """ # _NEWLINE, _DOUBLE_NEWLINE or content of the previous chunk (str) prev = _DOUBLE_NEWLINE def should_add_space(text, prev): """ Return True if extra whitespace should be added before text """ if prev in {_NEWLINE, _DOUBLE_NEWLINE}: return False if not guess_punct_space: return True if not _has_trailing_whitespace(prev): if _has_punct_after(text) or _has_open_bracket_before(prev): return False return True def get_space_between(text, prev): if not text: return ' ' return ' ' if should_add_space(text, prev) else '' def add_newlines(tag, context): if not guess_layout: return prev = context.prev if prev is _DOUBLE_NEWLINE: # don't output more than 1 blank line return if tag in double_newline_tags: context.prev = _DOUBLE_NEWLINE chunks.append('\n' if prev is _NEWLINE else '\n\n') elif tag in newline_tags: context.prev = _NEWLINE if prev is not _NEWLINE: chunks.append('\n') def add_text(text_content, context): text = _normalize_whitespace(text_content) if text_content else '' if not text: return space = get_space_between(text, context.prev) chunks.extend([space, text]) context.prev = text_content def traverse_text_fragments(tree, context, handle_tail=True): """ Extract text from the ``tree``: fill ``chunks`` variable """ add_newlines(tree.tag, context) add_text(tree.text, context) for child in tree: traverse_text_fragments(child, context) add_newlines(tree.tag, context) if handle_tail: add_text(tree.tail, context) traverse_text_fragments(tree, context=Context(), handle_tail=False) return ''.join(chunks).strip() def selector_to_text(sel, guess_punct_space=True, guess_layout=True): """ Convert a cleaned parsel.Selector to text. See html_text.extract_text docstring for description of the approach and options. """ import parsel if isinstance(sel, parsel.SelectorList): # if selecting a specific xpath text = [] for s in sel: extracted = etree_to_text( s.root, guess_punct_space=guess_punct_space, guess_layout=guess_layout) if extracted: text.append(extracted) return ' '.join(text) else: return etree_to_text( sel.root, guess_punct_space=guess_punct_space, guess_layout=guess_layout) def cleaned_selector(html): """ Clean parsel.selector. """ import parsel try: tree = _cleaned_html_tree(html) sel = parsel.Selector(root=tree, type='html') except (lxml.etree.XMLSyntaxError, lxml.etree.ParseError, lxml.etree.ParserError, UnicodeEncodeError): # likely plain text sel = parsel.Selector(html) return sel def extract_text(html, guess_punct_space=True, guess_layout=True, newline_tags=NEWLINE_TAGS, double_newline_tags=DOUBLE_NEWLINE_TAGS): """ Convert html to text, cleaning invisible content such as styles. Almost the same as normalize-space xpath, but this also adds spaces between inline elements (like <span>) which are often used as block elements in html markup, and adds appropriate newlines to make output better formatted. html should be a unicode string or an already parsed lxml.html element. ``html_text.etree_to_text`` is a lower-level function which only accepts an already parsed lxml.html Element, and is not doing html cleaning itself. When guess_punct_space is True (default), no extra whitespace is added for punctuation. This has a slight (around 10%) performance overhead and is just a heuristic. When guess_layout is True (default), a newline is added before and after ``newline_tags`` and two newlines are added before and after ``double_newline_tags``. This heuristic makes the extracted text more similar to how it is rendered in the browser. Default newline and double newline tags can be found in `html_text.NEWLINE_TAGS` and `html_text.DOUBLE_NEWLINE_TAGS`. """ if html is None: return '' cleaned = _cleaned_html_tree(html) return etree_to_text( cleaned, guess_punct_space=guess_punct_space, guess_layout=guess_layout, newline_tags=newline_tags, double_newline_tags=double_newline_tags, ) ������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������html-text-0.5.2/setup.cfg���������������������������������������������������������������������������0000664�0000000�0000000�00000000526�13706107607�0015277�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������[bumpversion] current_version = 0.5.2 commit = True tag = True tag_name = {new_version} [bumpversion:file:setup.py] search = version='{current_version}' replace = version='{new_version}' [bumpversion:file:html_text/__init__.py] search = __version__ = '{current_version}' replace = __version__ = '{new_version}' [bdist_wheel] universal = 1 ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������html-text-0.5.2/setup.py����������������������������������������������������������������������������0000775�0000000�0000000�00000002317�13706107607�0015173�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������#!/usr/bin/env python # -*- coding: utf-8 -*- from setuptools import setup with open('README.rst') as readme_file: readme = readme_file.read() with open('CHANGES.rst') as history_file: history = history_file.read() setup( name='html_text', version='0.5.2', description="Extract text from HTML", long_description=readme + '\n\n' + history, author="Konstantin Lopukhin", author_email='kostia.lopuhin@gmail.com', url='https://github.com/TeamHG-Memex/html-text', packages=['html_text'], include_package_data=True, install_requires=['lxml'], license="MIT license", zip_safe=False, classifiers=[ 'Development Status :: 4 - Beta', 'Intended Audience :: Developers', 'License :: OSI Approved :: MIT License', 'Natural Language :: English', "Programming Language :: Python :: 2", 'Programming Language :: Python :: 2.7', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: 3.8', ], test_suite='tests', tests_require=['pytest'], ) �����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������html-text-0.5.2/tests/������������������������������������������������������������������������������0000775�0000000�0000000�00000000000�13706107607�0014615�5����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������html-text-0.5.2/tests/__init__.py�������������������������������������������������������������������0000664�0000000�0000000�00000000030�13706107607�0016717�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# -*- coding: utf-8 -*- ��������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������html-text-0.5.2/tests/test_html_text.py�������������������������������������������������������������0000664�0000000�0000000�00000017322�13706107607�0020243�0����������������������������������������������������������������������������������������������������ustar�00root����������������������������root����������������������������0000000�0000000������������������������������������������������������������������������������������������������������������������������������������������������������������������������# -*- coding: utf-8 -*- import glob import os import six import pytest from html_text import (extract_text, parse_html, cleaned_selector, etree_to_text, cleaner, selector_to_text, NEWLINE_TAGS, DOUBLE_NEWLINE_TAGS) ROOT = os.path.dirname(os.path.abspath(__file__)) @pytest.fixture(params=[ {'guess_punct_space': True, 'guess_layout': False}, {'guess_punct_space': False, 'guess_layout': False}, {'guess_punct_space': True, 'guess_layout': True}, {'guess_punct_space': False, 'guess_layout': True} ]) def all_options(request): return request.param def test_extract_no_text_html(all_options): html = (u'<!DOCTYPE html><html><body><p><video width="320" height="240" ' 'controls><source src="movie.mp4" type="video/mp4"><source ' 'src="movie.ogg" type="video/ogg"></video></p></body></html>') assert extract_text(html, **all_options) == u'' def test_extract_text(all_options): html = (u'<html><style>.div {}</style>' '<body><p>Hello, world!</body></html>') assert extract_text(html, **all_options) == u'Hello, world!' def test_declared_encoding(all_options): html = (u'<?xml version="1.0" encoding="utf-8" ?>' u'<html><style>.div {}</style>' u'<body>Hello, world!</p></body></html>') assert extract_text(html, **all_options) == u'Hello, world!' def test_empty(all_options): assert extract_text(u'', **all_options) == '' assert extract_text(u' ', **all_options) == '' assert extract_text(None, **all_options) == '' def test_comment(all_options): assert extract_text(u"<!-- hello world -->", **all_options) == '' def test_extract_text_from_tree(all_options): html = (u'<html><style>.div {}</style>' '<body><p>Hello, world!</body></html>') tree = parse_html(html) assert extract_text(tree, **all_options) == u'Hello, world!' def test_extract_text_from_node(all_options): html = (u'<html><style>.div {}</style>' '<body><p>Hello, world!</p></body></html>') tree = parse_html(html) node = tree.xpath('//p')[0] assert extract_text(node, **all_options) == u'Hello, world!' def test_inline_tags_whitespace(all_options): html = u'<span>field</span><span>value of</span><span></span>' assert extract_text(html, **all_options) == u'field value of' def test_extract_text_from_fail_html(): html = "<html><frameset><frame></frameset></html>" tree = parse_html(html) node = tree.xpath('/html/frameset')[0] assert extract_text(node) == u'' def test_punct_whitespace(): html = u'<div><span>field</span>, and more</div>' assert extract_text(html, guess_punct_space=False) == u'field , and more' assert extract_text(html, guess_punct_space=True) == u'field, and more' def test_punct_whitespace_preserved(): html = (u'<div><span>по</span><span>ле</span>, and , ' u'<span>more </span>!<span>now</div>a (<b>boo</b>)') text = extract_text(html, guess_punct_space=True, guess_layout=False) assert text == u'по ле, and , more ! now a (boo)' @pytest.mark.xfail(reason="code punctuation should be handled differently") def test_bad_punct_whitespace(): html = (u'<pre><span>trees</span> ' '<span>=</span> <span>webstruct</span>' '<span>.</span><span>load_trees</span>' '<span>(</span><span>"train/*.html"</span>' '<span>)</span></pre>') text = extract_text(html, guess_punct_space=False, guess_layout=False) assert text == u'trees = webstruct . load_trees ( "train/*.html" )' text = extract_text(html, guess_punct_space=True, guess_layout=False) assert text == u'trees = webstruct.load_trees("train/*.html")' def test_selectors(all_options): pytest.importorskip("parsel") html = (u'<span><span id="extract-me">text<a>more</a>' '</span>and more text <a> and some more</a> <a></a> </span>') # Selector sel = cleaned_selector(html) assert selector_to_text(sel, **all_options) == 'text more and more text and some more' # SelectorList subsel = sel.xpath('//span[@id="extract-me"]') assert selector_to_text(subsel, **all_options) == 'text more' subsel = sel.xpath('//a') assert selector_to_text(subsel, **all_options) == 'more and some more' subsel = sel.xpath('//a[@id="extract-me"]') assert selector_to_text(subsel, **all_options) == '' subsel = sel.xpath('//foo') assert selector_to_text(subsel, **all_options) == '' def test_nbsp(): if six.PY2: raise pytest.xfail("  produces '\xa0' in Python 2, " "but ' ' in Python 3") html = "<h1>Foo Bar</h1>" assert extract_text(html) == "Foo Bar" def test_guess_layout(): html = (u'<title> title
text_1.

text_2 text_3

' '

  • text_4
  • text_5
' '

text_6text_7text_8

text_9
' '

...text_10

') text = 'title text_1. text_2 text_3 text_4 text_5 text_6 text_7 ' \ 'text_8 text_9 ...text_10' assert extract_text(html, guess_punct_space=False, guess_layout=False) == text text = ('title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5' '\n\ntext_6 text_7 text_8\n\ntext_9\n\n...text_10') assert extract_text(html, guess_punct_space=False, guess_layout=True) == text text = 'title text_1. text_2 text_3 text_4 text_5 text_6 text_7 ' \ 'text_8 text_9...text_10' assert extract_text(html, guess_punct_space=True, guess_layout=False) == text text = 'title\n\ntext_1.\n\ntext_2 text_3\n\ntext_4\ntext_5\n\n' \ 'text_6 text_7 text_8\n\ntext_9\n\n...text_10' assert extract_text(html, guess_punct_space=True, guess_layout=True) == text def test_basic_newline(): html = u'
a
b
' assert extract_text(html, guess_punct_space=False, guess_layout=False) == 'a b' assert extract_text(html, guess_punct_space=False, guess_layout=True) == 'a\nb' assert extract_text(html, guess_punct_space=True, guess_layout=False) == 'a b' assert extract_text(html, guess_punct_space=True, guess_layout=True) == 'a\nb' def test_adjust_newline(): html = u'
text 1

text 2

' assert extract_text(html, guess_layout=True) == 'text 1\n\ntext 2' def test_personalize_newlines_sets(): html = (u'textmore' 'and more text and some more ') text = extract_text(html, guess_layout=True, newline_tags=NEWLINE_TAGS | {'a'}) assert text == 'text\nmore\nand more text\nand some more' text = extract_text(html, guess_layout=True, double_newline_tags=DOUBLE_NEWLINE_TAGS | {'a'}) assert text == 'text\n\nmore\n\nand more text\n\nand some more' def _webpage_paths(): webpages = sorted(glob.glob(os.path.join(ROOT, 'test_webpages', '*.html'))) extracted = sorted(glob.glob(os.path.join(ROOT, 'test_webpages','*.txt'))) return list(zip(webpages, extracted)) def _load_file(path): with open(path, 'rb') as f: return f.read().decode('utf8') @pytest.mark.parametrize(['page', 'extracted'], _webpage_paths()) def test_webpages(page, extracted): html = _load_file(page) if not six.PY3: # FIXME:   produces '\xa0' in Python 2, but ' ' in Python 3 # this difference is ignored in this test. # What is the correct behavior? html = html.replace(' ', ' ') expected = _load_file(extracted) assert extract_text(html) == expected tree = cleaner.clean_html(parse_html(html)) assert etree_to_text(tree) == expected html-text-0.5.2/tests/test_webpages/000077500000000000000000000000001370610760700174515ustar00rootroot00000000000000html-text-0.5.2/tests/test_webpages/A Light in the Attic | Books to Scrape - Sandbox.html000066400000000000000000000220771370610760700305230ustar00rootroot00000000000000 A Light in the Attic | Books to Scrape - Sandbox
Books to Scrape We love being scraped!

A Light in the Attic

£51.77

In stock (22 available)

 


Product Description

It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more

Product Information

UPCa897fe39b1053632
Product TypeBooks
Price (excl. tax)£51.77
Price (incl. tax)£51.77
Tax£0.00
Availability In stock (22 available)
Number of reviews 0
html-text-0.5.2/tests/test_webpages/A Light in the Attic | Books to Scrape - Sandbox.txt000066400000000000000000000030031370610760700303620ustar00rootroot00000000000000A Light in the Attic | Books to Scrape - Sandbox Books to Scrape We love being scraped! Home Books Poetry A Light in the Attic A Light in the Attic £51.77 In stock (22 available) Warning! This is a demo website for web scraping purposes. Prices and ratings here were randomly assigned and have no real meaning. Product Description It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounded so good. ...more Product Information UPC a897fe39b1053632 Product Type Books Price (excl. tax) £51.77 Price (incl. tax) £51.77 Tax £0.00 Availability In stock (22 available) Number of reviews 0html-text-0.5.2/tests/test_webpages/IANA — IANA-managed Reserved Domains.html000066400000000000000000000237611370610760700273040ustar00rootroot00000000000000 IANA — IANA-managed Reserved Domains

IANA-managed Reserved Domains

Certain domains are set aside, and nominally registered to “IANA”, for specific policy or technical purposes.

Example domains

As described in RFC 2606 and RFC 6761, a number of domains such as example.com and example.org are maintained for documentation purposes. These domains may be used as illustrative examples in documents without prior coordination with us. They are not available for registration or transfer.

Test IDN top-level domains

These domains were temporarily delegated by IANA for the IDN Evaluation being conducted by ICANN.

DomainDomain (A-label)LanguageScript
إختبارXN--KGBECHTV ArabicArabic
آزمایشیXN--HGBK6AJ7F53BBA PersianArabic
测试XN--0ZWM56D ChineseHan (Simplified variant)
測試XN--G6W251D ChineseHan (Traditional variant)
испытаниеXN--80AKHBYKNJ4F RussianCyrillic
परीक्षाXN--11B5BS3A9AJ6G HindiDevanagari (Nagari)
δοκιμήXN--JXALPDLP Greek, Modern (1453-)Greek
테스트XN--9T4B11YI5A KoreanHangul (Hangŭl, Hangeul)
טעסטXN--DEBA0AD YiddishHebrew
テストXN--ZCKZAH JapaneseKatakana
பரிட்சைXN--HLCJ6AYA9ESC7A TamilTamil

Policy-reserved domains

We act as both the registrant and registrar for a select number of domains which have been reserved under policy grounds. These exclusions are typically indicated in either technical standards (RFC documents), or contractual limitations.

Domains which are described as registered to IANA or ICANN on policy grounds are not available for registration or transfer, with the exception of country-name.info domains. These domains are available for release by the ICANN Governmental Advisory Committee Secretariat.

Other Special-Use Domains

There is additionally a Special-Use Domain Names registry documenting special-use domains designated by technical standards. For further information, see Special-Use Domain Names (RFC 6761).

html-text-0.5.2/tests/test_webpages/IANA — IANA-managed Reserved Domains.txt000066400000000000000000000054441370610760700271550ustar00rootroot00000000000000IANA — IANA-managed Reserved Domains Domains Numbers Protocols About Us IANA-managed Reserved Domains Certain domains are set aside, and nominally registered to “IANA”, for specific policy or technical purposes. Example domains As described in RFC 2606 and RFC 6761, a number of domains such as example.com and example.org are maintained for documentation purposes. These domains may be used as illustrative examples in documents without prior coordination with us. They are not available for registration or transfer. Test IDN top-level domains These domains were temporarily delegated by IANA for the IDN Evaluation being conducted by ICANN. Domain Domain (A-label) Language Script إختبار XN--KGBECHTV Arabic Arabic آزمایشی XN--HGBK6AJ7F53BBA Persian Arabic 测试 XN--0ZWM56D Chinese Han (Simplified variant) 測試 XN--G6W251D Chinese Han (Traditional variant) испытание XN--80AKHBYKNJ4F Russian Cyrillic परीक्षा XN--11B5BS3A9AJ6G Hindi Devanagari (Nagari) δοκιμή XN--JXALPDLP Greek, Modern (1453-) Greek 테스트 XN--9T4B11YI5A Korean Hangul (Hangŭl, Hangeul) טעסט XN--DEBA0AD Yiddish Hebrew テスト XN--ZCKZAH Japanese Katakana பரிட்சை XN--HLCJ6AYA9ESC7A Tamil Tamil Policy-reserved domains We act as both the registrant and registrar for a select number of domains which have been reserved under policy grounds. These exclusions are typically indicated in either technical standards (RFC documents), or contractual limitations. Domains which are described as registered to IANA or ICANN on policy grounds are not available for registration or transfer, with the exception of country-name.info domains. These domains are available for release by the ICANN Governmental Advisory Committee Secretariat. Other Special-Use Domains There is additionally a Special-Use Domain Names registry documenting special-use domains designated by technical standards. For further information, see Special-Use Domain Names (RFC 6761). Domain Names Overview Root Zone Management Overview Root Database Hint and Zone Files Change Requests Instructions & Guides Root Servers .INT Registry Overview Register/modify an .INT domain Eligibility .ARPA Registry IDN Practices Repository Overview Submit a table Root Key Signing Key (DNSSEC) Overview Trusts Anchors and Keys Root KSK Ceremonies Practice Statement Community Representatives Reserved Domains Domain Names Root Zone Registry .INT Registry .ARPA Registry IDN Repository Number Resources Abuse Information Protocols Protocol Registries Time Zone Database About Us Presentations Reports Performance Reviews Excellence Contact Us The IANA functions coordinate the Internet’s globally unique identifiers, and are provided by Public Technical Identifiers, an affiliate of ICANN. Privacy Policy Terms of Servicehtml-text-0.5.2/tests/test_webpages/Scrapinghub Enterprise Solutions.html000066400000000000000000002031151370610760700266670ustar00rootroot00000000000000 Scrapinghub Enterprise Solutions

Web data, hassle-free, for real business needs

Get a free consultation

From the world leading experts in web scraping

Lead generation, competitor & sales intelligence

Alternative data for finance, equity and market research

Dark web, law enforcement & compliance

Staffing, talent sourcing & job market research

Product aggregation & price monitoring for retail, e-commerce & manufacturers

Monitoring of ratings and reviews, sentiment analysis & social network intelligence

Need some advice?

The best web crawler team

Authors of the #1 web crawling framework, the world’s most experienced team of engineers will help you get the very best results for your project.

7 billion pages crawled on our platform
per month

We are the authors of the most popular open-source web scraping tools. You can be assured that our services are the best in class.

100% money-back guarantee

All your scraping projects are backed by us. Maintenance agreements and enterprise SLAs available to ensure long-term success.

We scrape the web for:

html-text-0.5.2/tests/test_webpages/Scrapinghub Enterprise Solutions.txt000066400000000000000000000110331370610760700265360ustar00rootroot00000000000000Scrapinghub Enterprise Solutions Enterprise Solutions Products Data on Demand Turn web content into useful data for your business Crawlera Smart Proxy A smart proxy that never gets banned and doesn't need IP rotation Professional Services The most experienced team from the market leaders in web scraping Scrapy Cloud Deploy and manage your Scrapy spiders with your web scraping team Scrapy Training Get your team trained on Scrapy, by the team that create Scrapy itself Developer Tools Scrapy Cloud Deploy and manage your Scrapy spiders with your web scraping team Crawlera Smart Proxy A smart proxy that never gets banned and doesn't need IP rotation Splash A full blown browser behind an API, to render pages and execute actions Pricing Data on Demand Crawlera Scrapy Cloud Splash Sign In hamburger Enterprise Solutions Products Data on Demand Turn web content into useful data for your business Crawlera Smart Proxy A smart proxy that never gets banned and doesn't need IP rotation Professional Services The most experienced team from the market leaders in web scraping Scrapy Cloud Deploy and manage your Scrapy spiders with your web scraping team Scrapy Training Get your team trained on Scrapy, by the team that create Scrapy itself Developer Tools Scrapy Cloud Deploy and manage your Scrapy spiders with your web scraping team Crawlera Smart Proxy A smart proxy that never gets banned and doesn't need IP rotation Splash A full blown browser behind an API, to render pages and execute actions Pricing Data on Demand Crawlera Scrapy Cloud Splash Sign In Enterprise Solutions Complete web scraping services for any size business, from startups to Fortune 100’s Tell us about your project Web data, hassle-free, for real business needs Get a free consultation From the world leading experts in web scraping Lead generation, competitor & sales intelligence Alternative data for finance, equity and market research Dark web, law enforcement & compliance Staffing, talent sourcing & job market research Product aggregation & price monitoring for retail, e-commerce & manufacturers Monitoring of ratings and reviews, sentiment analysis & social network intelligence Let’s Partner Team up with the best web scraping engineers while you stay focused on your business goals Quality assurance, enterprise service-level agreements and maintenance plans Full access to your project’s code with training and handover Money-back guarantee for your project Get in touch Data on Demand Any size scraping project. Data refreshed regularly, reliably and in the form you want Accuracy and coverage guarantees Scraped data from virtually any number of web pages Post processing and automated data crawling updates anytime Get in touch Data Science Enriched data for your business that goes beyond traditional web crawling needs Your raw web data post-processed for real insights Link data across disparate scraped pages Deduce sentiment on a large scale Get in touch Training Learn from the recognised experts in data crawling and scraping to grow your own in-house team One-to-one and group training Standard introduction to web scraping Tailored courses to help you solve very specific business challenges Get in touch Need some advice? The best web crawler team Authors of the #1 web crawling framework, the world’s most experienced team of engineers will help you get the very best results for your project. 7 billion pages crawled on our platform per month We are the authors of the most popular open-source web scraping tools. You can be assured that our services are the best in class. 100% money-back guarantee All your scraping projects are backed by us. Maintenance agreements and enterprise SLAs available to ensure long-term success. Ask any question We scrape the web for: Need web data? Contact us scrapinghub-letter-logo Cuil Greine House Ballincollig Commercial Park, Link Road Ballincollig, Co. Cork, Ireland VAT Number IE 9787078K Follow us Company About us Clients Open Source Contact Jobs Press Products Data on Demand Proxy Network Professional Services Scrapy Training Developers Scrapy Cloud Crawlera Splash Resources Webinars Blog Documentation Support & KB Status Terms of Service Abuse Report Privacy Policy Cookie Policy © 2010-2017 Scrapinghub Scrapinghub uses cookies to enhance your experience, analyze our website traffic, and share information with our analytics partners. By using this website you consent to our use of cookies. For more information, please refer to our Cookie Policy. I Agreehtml-text-0.5.2/tests/test_webpages/Tutorial — Webstruct 0.6 documentation.html000066400000000000000000001125371370610760700302620ustar00rootroot00000000000000 Tutorial — Webstruct 0.6 documentation

Tutorial

This tutorial assumes you are familiar with machine learning.

Get annotated data

First, you need the training/development data. We suggest to use WebAnnotator Firefox extension to annotate HTML pages.

Recommended WebAnnotator options:

_images/wa-options.png

Pro tip - enable WebAnnotator toolbar buttons:

_images/wa-buttons.png

Follow WebAnnotator manual to define named entities and annotate some web pages (nested WebAnnotator entities are not supported). Use “Save as..” menu item or “Save as” toolbar button to save the results; don’t use “Export as”.

After that you can load annotated webpages as lxml trees:

import webstruct
trees = webstruct.load_trees("train/*.html", webstruct.WebAnnotatorLoader())

See HTML Loaders for more info. GATE annotation format is also supported.

From HTML to Tokens

To convert HTML trees to a format suitable for sequence prediction algorithm (like CRF, MEMM or Structured Perceptron) the following approach is used:

  1. Text is extracted from HTML and split into tokens.
  2. For each token a special HtmlToken instance is created. It contains information not only about the text token itself, but also about its position in HTML tree.

A single HTML page corresponds to a single input sequence (a list of HtmlTokens). For training/testing data (where webpages are already annotated) there is also a list of labels for each webpage, a label per HtmlToken.

To transform HTML trees into labels and HTML tokens use HtmlTokenizer.

html_tokenizer = webstruct.HtmlTokenizer()
X, y = html_tokenizer.tokenize(trees)

Input trees should be loaded by one of the WebStruct loaders. For consistency, for each tree (even if it is loaded from raw unannotated html) HtmlTokenizer extracts two arrays: a list of HtmlToken instances and a list of tags encoded using IOB2 encoding (also known as BIO encoding). So in our example X is a list of lists of HtmlToken instances, and y is a list of lists of strings.

Feature Extraction

For supervised machine learning algorithms to work we need to extract features.

In WebStruct feature vectors are Python dicts {"feature_name": "feature_value"}; a dict is computed for each HTML token. How to convert these dicts into representation required by a sequence labelling toolkit depends on a toolkit used; we will cover that later.

To compute feature dicts we’ll use HtmlFeatureExtractor.

First, define your feature functions. A feature function should take an HtmlToken instance and return a feature dict; feature dicts from individual feature functions will be merged into the final feature dict for a token. Feature functions can ask questions about token itself, its neighbours (in the same HTML element), its position in HTML.

Note

WebStruct supports other kind of feature functions that work on multiple tokens; we don’t cover them in this tutorial.

There are predefined feature functions in webstruct.features, but for this tutorial let’s create some functions ourselves:

def token_identity(html_token):
    return {'token': html_token.token}

def token_isupper(html_token):
    return {'isupper': html_token.token.isupper()}

def parent_tag(html_token):
    return {'parent_tag': html_token.parent.tag}

def border_at_left(html_token):
    return {'border_at_left': html_token.index == 0}

Next, create HtmlFeatureExtractor:

feature_extractor = HtmlFeatureExtractor(
    token_features = [
        token_identity,
        token_isupper,
        parent_tag,
        border_at_left
    ]
)

and use it to extract feature dicts:

features = feature_extractor.fit_transform(X)

See Feature Extraction for more info about HTML tokenization and feature extraction.

Using a Sequence Labelling Toolkit

WebStruct doesn’t provide a CRF or Structured Perceptron implementation; learning and prediction is supposed to be handled by an external sequence labelling toolkit like CRFSuite, Wapiti or seqlearn.

Once feature dicts are extracted from HTML you should convert them to a format required by your sequence labelling tooklit and use this toolkit to train a model and do the prediction. For example, you may use DictVectorizer from scikit-learn to convert feature dicts into seqlearn input format.

We’ll use CRFSuite in this tutorial.

WebStruct provides some helpers for CRFSuite sequence labelling toolkit. To use CRFSuite with WebStruct, you need

  • sklearn-crfsuite package (which depends on python-crfsuite and sklearn)

Defining a Model

Basic way to define CRF model is the following:

model = webstruct.create_crfsuite_pipeline(
        token_features=[token_identity, token_isupper, parent_tag, border_at_left],
        verbose=True
    )

First create_crfsuite_pipeline() argument is a list of feature functions which will be used for training. verbose is a boolean parameter enabling verbose output of various training information; check sklearn-crfsuite API reference for available options.

Under the hood create_crfsuite_pipeline() creates a sklearn.pipeline.Pipeline with an HtmlFeatureExtractor instance followed by sklearn_crfsuite.CRF instance. The example above is just a shortcut for the following:

model = Pipeline([
    ('fe', HtmlFeatureExtractor(
        token_features = [
            token_identity,
            token_isupper,
            parent_tag,
            border_at_left,
        ]
    )),
    ('crf', sklearn_crfsuite.CRF(
        verbose=True
    )),
])

Training

To train a model use its fit method:

model.fit(X, y)

X and y are return values of HtmlTokenizer.tokenize() (a list of lists of HtmlToken instances and a list of lists of string IOB labels).

If you use sklearn_crfsuite.CRF directly then train it using CRF.fit() method. It accepts 2 lists: a list of lists of feature dicts, and a list of lists of tags:

model.fit(features, y)

Named Entity Recognition

Once you got a trained model you can use it to extract entities from unseen (unannotated) webpages. First, get some binary HTML data:

>>> import urllib2
>>> html = urllib2.urlopen("http://scrapinghub.com/contact").read()

Then create a NER instance initialized with a trained model:

>>> ner = webstruct.NER(model)

The model must provide a predict method that extracts features from HTML tokens and predicts labels for these tokens. A pipeline created with create_crfsuite_pipeline() function fits this definition.

Finally, use NER.extract() method to extract entities:

>>> ner.extract(html)
[('Scrapinghub', 'ORG'), ..., ('Iturriaga 3429 ap. 1', 'STREET'), ...]

Generally, the steps are:

  1. Load data using HtmlLoader loader. If a custom HTML cleaner was used for loading training data make sure to apply it here as well.
  2. Use the same html_tokenizer as used for training to extract HTML tokens from loaded trees. All labels would be “O” when using HtmlLoader loader - y can be discarded.
  3. Use the same feature_extractor as used for training to extract features.
  4. Run your_crf.predict() method (e.g. CRF.predict()) on features extracted in (3) to get the prediction - a list of IOB2-encoded tags for each input document.
  5. Build entities from input tokens based on predicted tags (check IobEncoder.group() and smart_join()).
  6. Split entities into groups (optional). One way to do it is to use webstruct.grouping.

NER helper class combines HTML loading, HTML tokenization, feature extraction, CRF model, entity building and grouping.

Entity Grouping

Detecting entities on their own is not always enough; in many cases what is wanted is to find the relationship between them. For example, “street_name/STREET city_name/CITY zipcode_number/ZIPCODE form an address”, or “phone/TEL is a phone of person/PER”.

The first approximation is to say that all entities from a single webpage are related. For example, if we have extracted some organizaion/ORG and some phone/TEL from a single webpage we may assume that the phone is a contact phone of the organization.

Sometimes there are several “entity groups” on a webpage. If a page contains contact phones of several persons or several business locations it is better to split all entities into groups of related entities - “person name + his/her phone(s)” or “address”.

WebStruct provides an unsupervised algorithm for extracting such entity groups. Algorithm prefers to build large groups without entities of duplicate types; if a split is needed algorithm tries to split at points where distance between entities is larger.

Use NER.extract_groups() to extract groups of entities:

>>> ner.extract_groups(html)
[[...], ... [('Iturriaga 3429 ap. 1', 'STREET'), ('Montevideo', 'CITY'), ...]]

Sometimes it is better to allow some entity types to appear multuple times in a group. For example, a person (PER entity) may have several contact phones and faxes (TEL and FAX entities) - we should penalize groups with multiple PERs, but multiple TELs and FAXes are fine. Use dont_penalize argument if you want to allow some entity types to appear multiple times in a group:

ner.extract_groups(html, dont_penalize={'TEL', 'FAX'})

The simple algorithm WebStruct provides is by no means a general solution to relation detection, but give it a try - maybe it is enough for your task.

Model Development

To develop the model you need to choose the learning algorithm, features, hyperparameters, etc. To do that you need scoring metrics, cross-validation utilities and tools for debugging what classifier learned. WebStruct helps in the following way:

  1. Pipeline created by create_crfsuite_pipeline() is compatible with cross-validation and grid search utilities from scikit-learn; use them to select model parameters and check the quality.

    One limitation of create_crfsuite_pipeline() is that n_jobs in scikit-learn functions and classes should be 1, but other than that WebStruct objects should work fine with scikit-learn. Just keep in mind that for WebStruct an “observation” is a document, not an individual token, and a “label” is a sequence of labels for a document, not an individual IOB tag.

  2. There is webstruct.metrics module with a couple of metrics useful for sequence classification.

To debug what CRFSuite learned you could use eli5 library. With eli5 it would be two calls to eli5.explain_weights() and eli5.format_as_html() with sklearn_crfsuite.CRF instance as argument. As a result you will get transitions and feature weights.

Read the Docs v: latest
Versions
latest
stable
0.6
0.5
0.4.1
0.4
0.3
0.2
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.
html-text-0.5.2/tests/test_webpages/Tutorial — Webstruct 0.6 documentation.txt000066400000000000000000000251501370610760700301270ustar00rootroot00000000000000Tutorial — Webstruct 0.6 documentation Webstruct latest Webstruct Tutorial Get annotated data From HTML to Tokens Feature Extraction Using a Sequence Labelling Toolkit Defining a Model Training Named Entity Recognition Entity Grouping Model Development Reference Changes Webstruct Docs » Tutorial Edit on GitHub Tutorial ¶ This tutorial assumes you are familiar with machine learning. Get annotated data ¶ First, you need the training/development data. We suggest to use WebAnnotator Firefox extension to annotate HTML pages. Recommended WebAnnotator options: Pro tip - enable WebAnnotator toolbar buttons: Follow WebAnnotator manual to define named entities and annotate some web pages (nested WebAnnotator entities are not supported). Use “Save as..” menu item or “Save as” toolbar button to save the results; don’t use “Export as”. After that you can load annotated webpages as lxml trees: import webstruct trees = webstruct. load_trees ("train/*.html", webstruct. WebAnnotatorLoader ()) See HTML Loaders for more info. GATE annotation format is also supported. From HTML to Tokens ¶ To convert HTML trees to a format suitable for sequence prediction algorithm (like CRF, MEMM or Structured Perceptron) the following approach is used: Text is extracted from HTML and split into tokens. For each token a special HtmlToken instance is created. It contains information not only about the text token itself, but also about its position in HTML tree. A single HTML page corresponds to a single input sequence (a list of HtmlTokens). For training/testing data (where webpages are already annotated) there is also a list of labels for each webpage, a label per HtmlToken. To transform HTML trees into labels and HTML tokens use HtmlTokenizer. html_tokenizer = webstruct. HtmlTokenizer () X, y = html_tokenizer. tokenize (trees) Input trees should be loaded by one of the WebStruct loaders. For consistency, for each tree (even if it is loaded from raw unannotated html) HtmlTokenizer extracts two arrays: a list of HtmlToken instances and a list of tags encoded using IOB2 encoding (also known as BIO encoding). So in our example X is a list of lists of HtmlToken instances, and y is a list of lists of strings. Feature Extraction ¶ For supervised machine learning algorithms to work we need to extract features. In WebStruct feature vectors are Python dicts {"feature_name":"feature_value"}; a dict is computed for each HTML token. How to convert these dicts into representation required by a sequence labelling toolkit depends on a toolkit used; we will cover that later. To compute feature dicts we’ll use HtmlFeatureExtractor. First, define your feature functions. A feature function should take an HtmlToken instance and return a feature dict; feature dicts from individual feature functions will be merged into the final feature dict for a token. Feature functions can ask questions about token itself, its neighbours (in the same HTML element), its position in HTML. Note WebStruct supports other kind of feature functions that work on multiple tokens; we don’t cover them in this tutorial. There are predefined feature functions in webstruct.features, but for this tutorial let’s create some functions ourselves: def token_identity (html_token): return { 'token': html_token. token } def token_isupper (html_token): return { 'isupper': html_token. token. isupper ()} def parent_tag (html_token): return { 'parent_tag': html_token. parent. tag } def border_at_left (html_token): return { 'border_at_left': html_token. index == 0 } Next, create HtmlFeatureExtractor: feature_extractor = HtmlFeatureExtractor (token_features = [ token_identity, token_isupper, parent_tag, border_at_left ]) and use it to extract feature dicts: features = feature_extractor. fit_transform (X) See Feature Extraction for more info about HTML tokenization and feature extraction. Using a Sequence Labelling Toolkit ¶ WebStruct doesn’t provide a CRF or Structured Perceptron implementation; learning and prediction is supposed to be handled by an external sequence labelling toolkit like CRFSuite, Wapiti or seqlearn. Once feature dicts are extracted from HTML you should convert them to a format required by your sequence labelling tooklit and use this toolkit to train a model and do the prediction. For example, you may use DictVectorizer from scikit-learn to convert feature dicts into seqlearn input format. We’ll use CRFSuite in this tutorial. WebStruct provides some helpers for CRFSuite sequence labelling toolkit. To use CRFSuite with WebStruct, you need sklearn-crfsuite package (which depends on python-crfsuite and sklearn) Defining a Model ¶ Basic way to define CRF model is the following: model = webstruct. create_crfsuite_pipeline (token_features = [ token_identity, token_isupper, parent_tag, border_at_left ], verbose = True) First create_crfsuite_pipeline() argument is a list of feature functions which will be used for training. verbose is a boolean parameter enabling verbose output of various training information; check sklearn-crfsuite API reference for available options. Under the hood create_crfsuite_pipeline() creates a sklearn.pipeline.Pipeline with an HtmlFeatureExtractor instance followed by sklearn_crfsuite.CRF instance. The example above is just a shortcut for the following: model = Pipeline ([ ('fe', HtmlFeatureExtractor (token_features = [ token_identity, token_isupper, parent_tag, border_at_left, ])), ('crf', sklearn_crfsuite. CRF (verbose = True)), ]) Training ¶ To train a model use its fit method: model. fit (X, y) X and y are return values of HtmlTokenizer.tokenize() (a list of lists of HtmlToken instances and a list of lists of string IOB labels). If you use sklearn_crfsuite.CRF directly then train it using CRF.fit() method. It accepts 2 lists: a list of lists of feature dicts, and a list of lists of tags: model. fit (features, y) Named Entity Recognition ¶ Once you got a trained model you can use it to extract entities from unseen (unannotated) webpages. First, get some binary HTML data: >>> import urllib2 >>> html = urllib2. urlopen ("http://scrapinghub.com/contact"). read () Then create a NER instance initialized with a trained model: >>> ner = webstruct. NER (model) The model must provide a predict method that extracts features from HTML tokens and predicts labels for these tokens. A pipeline created with create_crfsuite_pipeline() function fits this definition. Finally, use NER.extract() method to extract entities: >>> ner. extract (html) [('Scrapinghub', 'ORG'), ..., ('Iturriaga 3429 ap. 1', 'STREET'), ...] Generally, the steps are: Load data using HtmlLoader loader. If a custom HTML cleaner was used for loading training data make sure to apply it here as well. Use the same html_tokenizer as used for training to extract HTML tokens from loaded trees. All labels would be “O” when using HtmlLoader loader - y can be discarded. Use the same feature_extractor as used for training to extract features. Run your_crf.predict() method (e.g. CRF.predict()) on features extracted in (3) to get the prediction - a list of IOB2-encoded tags for each input document. Build entities from input tokens based on predicted tags (check IobEncoder.group() and smart_join()). Split entities into groups (optional). One way to do it is to use webstruct.grouping. NER helper class combines HTML loading, HTML tokenization, feature extraction, CRF model, entity building and grouping. Entity Grouping ¶ Detecting entities on their own is not always enough; in many cases what is wanted is to find the relationship between them. For example, “ street_name/STREET city_name/CITY zipcode_number/ZIPCODE form an address”, or “ phone/TEL is a phone of person/PER ”. The first approximation is to say that all entities from a single webpage are related. For example, if we have extracted some organizaion/ORG and some phone/TEL from a single webpage we may assume that the phone is a contact phone of the organization. Sometimes there are several “entity groups” on a webpage. If a page contains contact phones of several persons or several business locations it is better to split all entities into groups of related entities - “person name + his/her phone(s)” or “address”. WebStruct provides an unsupervised algorithm for extracting such entity groups. Algorithm prefers to build large groups without entities of duplicate types; if a split is needed algorithm tries to split at points where distance between entities is larger. Use NER.extract_groups() to extract groups of entities: >>> ner. extract_groups (html) [[...], ... [('Iturriaga 3429 ap. 1', 'STREET'), ('Montevideo', 'CITY'), ...]] Sometimes it is better to allow some entity types to appear multuple times in a group. For example, a person (PER entity) may have several contact phones and faxes (TEL and FAX entities) - we should penalize groups with multiple PERs, but multiple TELs and FAXes are fine. Use dont_penalize argument if you want to allow some entity types to appear multiple times in a group: ner. extract_groups (html, dont_penalize = { 'TEL', 'FAX' }) The simple algorithm WebStruct provides is by no means a general solution to relation detection, but give it a try - maybe it is enough for your task. Model Development ¶ To develop the model you need to choose the learning algorithm, features, hyperparameters, etc. To do that you need scoring metrics, cross-validation utilities and tools for debugging what classifier learned. WebStruct helps in the following way: Pipeline created by create_crfsuite_pipeline() is compatible with cross-validation and grid search utilities from scikit-learn; use them to select model parameters and check the quality. One limitation of create_crfsuite_pipeline() is that n_jobs in scikit-learn functions and classes should be 1, but other than that WebStruct objects should work fine with scikit-learn. Just keep in mind that for WebStruct an “observation” is a document, not an individual token, and a “label” is a sequence of labels for a document, not an individual IOB tag. There is webstruct.metrics module with a couple of metrics useful for sequence classification. To debug what CRFSuite learned you could use eli5 library. With eli5 it would be two calls to eli5.explain_weights() and eli5.format_as_html() with sklearn_crfsuite.CRF instance as argument. As a result you will get transitions and feature weights. Next Previous © Copyright 2014-2017, Scrapinghub Inc.. Revision 9e461566. Built with Sphinx using a theme provided by Read the Docs. Read the Docs v: latest Versions latest stable 0.6 0.5 0.4.1 0.4 0.3 0.2 Downloads pdf htmlzip epub On Read the Docs Project Home Builds Free document hosting provided by Read the Docs.html-text-0.5.2/tests/test_webpages/Webstruct — Webstruct 0.6 documentation.html000066400000000000000000000266561370610760700304470ustar00rootroot00000000000000 Webstruct — Webstruct 0.6 documentation

Read the Docs v: latest
Versions
latest
stable
0.6
0.5
0.4.1
0.4
0.3
0.2
Downloads
pdf
htmlzip
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.
html-text-0.5.2/tests/test_webpages/Webstruct — Webstruct 0.6 documentation.txt000066400000000000000000000023141370610760700303030ustar00rootroot00000000000000Webstruct — Webstruct 0.6 documentation Webstruct latest Webstruct Tutorial Reference Changes Webstruct Docs » Webstruct Edit on GitHub Webstruct ¶ Webstruct is a library for creating statistical NER systems that work on HTML data, i.e. a library for building tools that extract named entities (addresses, organization names, open hours, etc) from webpages. Contents: Webstruct Overview Installation Tutorial Get annotated data From HTML to Tokens Feature Extraction Using a Sequence Labelling Toolkit Named Entity Recognition Entity Grouping Model Development Reference HTML Loaders Feature Extraction Model Creation Helpers Metrics Entity Grouping Wapiti Helpers CRFsuite Helpers WebAnnotator Utilities BaseSequenceClassifier Miscellaneous Changes 0.6 (2017-12-29) 0.5 (2017-05-10) 0.4.1 (2016-11-28) 0.4 (2016-11-26) 0.3 (2016-09-19) Indices and tables ¶ Index Module Index Search Page Next © Copyright 2014-2017, Scrapinghub Inc.. Revision 9e461566. Built with Sphinx using a theme provided by Read the Docs. Read the Docs v: latest Versions latest stable 0.6 0.5 0.4.1 0.4 0.3 0.2 Downloads pdf htmlzip epub On Read the Docs Project Home Builds Free document hosting provided by Read the Docs.html-text-0.5.2/tox.ini000066400000000000000000000007171370610760700147730ustar00rootroot00000000000000[tox] envlist = py27,py35,py36,py37,py38,{py27,py36}-parsel [testenv] deps = pytest pytest-cov {py27,py36}-parsel: parsel commands = pip install -U pip pip install -e . pytest --cov=html_text --cov-report=html --cov-report=term {env:PYTEST_DOC:} {posargs:.} [testenv:py27-parsel] setenv = PYTEST_DOC = --doctest-modules --doctest-glob='*.rst' [testenv:py36-parsel] setenv = PYTEST_DOC = --doctest-modules --doctest-glob='*.rst'