tag. 2017.10.4 ========== ---- * Fix #157: Fix images link with div wrap * Fix #55: Fix error when empty title tags * Fix #160: The html2text tests are failing on Windows and on Cygwin due to differences in eol handling between windows/*nix * Feature #164: Housekeeping: Add flake8 to the travis build, cleanup existing flake8 violations, add py3.6 and pypy3 to the travis build * Fix #109: Fix for unexpanded < > & * Fix #143: Fix line wrapping for the lines starting with bold * Adds support for numeric bold text indication in ``font-weight``, as used by Google (and presumably others.) * Fix #173 and #142: Stripping whitespace in crucial markdown and adding whitespace as necessary * Don't drop any cell data on tables uneven row lengths (e.g. colspan in use) 2016.9.19 ========= ---- * Default image alt text option created and set to a default of empty string "" to maintain backward compatibility * Fix #136: --default-image-alt now takes a string as argument * Fix #113: Stop changing quiet levels on \/script tags. * Merge #126: Fix deprecation warning on py3 due to html.escape * Fix #145: Running test suite on Travis CI for Python 2.6. 2016.5.29 ========= ---- * Fix #125: --pad_tables now pads table cells to make them look nice. * Fix #114: Break does not interrupt blockquotes * Deprecation warnings for URL retrieval. 2016.4.2 ========= ---- * Fix #106: encoding by stdin * Fix #89: Python 3.5 support. * Fix #113: inplace baseurl substitution for andtags. * Feature #118: Update the badges to badge.kloud51.com * Fix #119: new-line after a list is inserted 2016.1.8 ========= ---- * Feature #99: Removed duplicated initialisation. * Fix #100: Get element style key error. * Fix #101: Fix error end tag pop exception *
,,now rendered as ~~text~~. 2015.11.4 ========= ---- * Fix #38: Long links wrapping controlled by ``--no-wrap-links``. * Note: ``--no-wrap-links`` implies ``--reference-links`` * Feature #83: Add callback-on-tag. * Fix #87: Decode errors can be handled via command line. * Feature #95: Docs, decode errors spelling mistake. * Fix #84: Make bodywidth kwarg overridable using config. 2015.6.21 ========= ---- * Fix #31: HTML entities stay inside link. * Fix #71: Coverage detects command line tests. * Fix #39: Documentation update. * Fix #61: Functionality added for optional use of automatic links. * Feature #80: ``title`` attribute is preserved in both inline and reference links. * Feature #82: More command line options. See docs. 2015.6.12 ========= ---- * Feature #76: Making ``pre`` blocks clearer for further automatic formatting. * Fix #71: Coverage detects tests carried out in ``subprocesses`` 2015.6.6 ======== ---- * Fix #24: ``3.200.3`` vs ``2014.7.3`` output quirks. * Fix #61. Malformed links in markdown output. * Feature #62: Automatic version number. * Fix #63: Nested code, anchor bug. * Fix #64: Proper handling of anchors with content that starts with tags. * Feature #67: Documentation all over the module. * Feature #70: Adding tests for the module. * Fix #73: Typo in config documentation. 2015.4.14 ========= ---- * Feature #59: Write image tags with height and width attrs as raw html to retain dimensions 2015.4.13 ========= ---- * Feature #56: Treat '-' file parameter as stdin. * Feature #57: Retain escaping of html except within code or pre tags. 2015.2.18 ========= ---- * Fix #38: Anchor tags with empty text or with ```` tags inside are no longer stripped. 2014.12.29 ========== ---- * Feature #51: Add single line break option. This feature is useful for ensuring that lots of extra line breaks do not end up in the resulting Markdown file in situations like Evernote .enex exports. Note that this only works properly if ``body-width`` is set to ``0``. 2014.12.24 ========== ---- * Feature #49: Added an images_to_alt option to discard images and keep only their alt. * Feature #50: Protect links, surrounding them with angle brackets to avoid breaking... * Feature: Add ``setup.cfg`` file. 2014.12.5 ========= ---- * Feature: Update ``README.md`` with usage examples. * Fix #35: Remove ``py_modules`` from ``setup.py``. * Fix #36: Excludes tests from being installed as a separate module. * Fix #37: Don't hardcode the path to the installed binary. * Fix: Readme typo in running cli. * Feature #40: Extract cli part to ``cli`` module. * Feature #42: Bring python version compatibility to ``compat.py`` module. * Feature #41: Extract utility/helper methods to ``utils`` module. * Fix #45: Does not accept standard input when running under Python 3. * Feature: Clean up ``ChangeLog.rst`` for version and date numbers. 2014.9.25 ========= ---- * Feature #29, #27: Add simple table support with bypass option. * Fix #20: Replace project website with: https://alir3z4.github.io/html2text/ . 2014.9.8 ======== ---- * Fix #28: missing ``html2text`` package in installation. 2014.9.7 ======== ---- * Fix ``unicode``/``type`` error in memory leak unit-test. * Feature #16: Remove ``install_deps.py``. * Feature #17: Add status badges via pypin. * Feature #18: Add ``Python`` ``3.4`` to travis config file. * Feature #19: Bring ``html2text`` to a separate module and take out the ``conf``/``constant`` variables. * Feature #21: Remove meta vars from ``html2text.py`` file header. * Fix: Fix TypeError when parsing tags like
. Fixed in #25. 2014.7.3 ======== ---- * Fix #8: Remove ``How to do a release`` section from README.md. * Fix #11: Include test directory markdown, html files. * Fix #13: memory leak in using ``handle`` while keeping the old instance of ``html2text``. 2014.4.5 ======== ---- * Fix #1: Add ``ChangeLog.rst`` file. * Fix #2: Add ``AUTHORS.rst`` file. html2text-2020.1.16/MANIFEST.in 0000664 0001750 0001750 00000000204 13435524347 015122 0 ustar jon jon 0000000 0000000 include COPYING include README.md include ChangeLog.rst include AUTHORS.rst include tox.ini recursive-include test *.html *.md *.py html2text-2020.1.16/PKG-INFO 0000664 0001750 0001750 00000011564 13610070526 014462 0 ustar jon jon 0000000 0000000 Metadata-Version: 2.1 Name: html2text Version: 2020.1.16 Summary: Turn HTML into equivalent Markdown-structured text. Home-page: https://github.com/Alir3z4/html2text/ Author: Aaron Swartz Author-email: me@aaronsw.com Maintainer: Alireza Savand Maintainer-email: alireza.savand@gmail.com License: GNU GPL 3 Description: # html2text [](https://travis-ci.org/Alir3z4/html2text) [](https://coveralls.io/r/Alir3z4/html2text) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Usage: `html2text [filename [encoding]]` | Option | Description |--------------------------------------------------------|--------------------------------------------------- | `--version` | Show program's version number and exit | `-h`, `--help` | Show this help message and exit | `--ignore-links` | Don't include any formatting for links |`--escape-all` | Escape all special characters. Output is less readable, but avoids corner case formatting issues. | `--reference-links` | Use reference links instead of links to create markdown | `--mark-code` | Mark preformatted and code blocks with [code]...[/code] For a complete list of options see the [docs](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) Or you can use it from within `Python`: ``` >>> import html2text >>> >>> print(html2text.html2text("
Zed's dead baby, Zed's dead.
")) **Zed's** dead baby, _Zed's_ dead. ``` Or with some configuration options: ``` >>> import html2text >>> >>> h = html2text.HTML2Text() >>> # Ignore converting links from HTML >>> h.ignore_links = True >>> print h.handle("Hello, world!") Hello, world! >>> print(h.handle("
Hello, world!")) Hello, world! >>> # Don't Ignore links anymore, I like links >>> h.ignore_links = False >>> print(h.handle("
Hello, world!")) Hello, [world](https://www.google.com/earth/)! ``` *Originally written by Aaron Swartz. This code is distributed under the GPLv3.* ## How to install `html2text` is available on pypi https://pypi.org/project/html2text/ ``` $ pip install html2text ``` ## How to run unit tests tox To see the coverage results: coverage html then open the `./htmlcov/index.html` file in your browser. ## Documentation Documentation lives [here](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) Platform: OS Independent Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License (GPL) Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Requires-Python: >=3.5 Description-Content-Type: text/markdown html2text-2020.1.16/README.md 0000664 0001750 0001750 00000006151 13435524347 014652 0 ustar jon jon 0000000 0000000 # html2text [](https://travis-ci.org/Alir3z4/html2text) [](https://coveralls.io/r/Alir3z4/html2text) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Usage: `html2text [filename [encoding]]` | Option | Description |--------------------------------------------------------|--------------------------------------------------- | `--version` | Show program's version number and exit | `-h`, `--help` | Show this help message and exit | `--ignore-links` | Don't include any formatting for links |`--escape-all` | Escape all special characters. Output is less readable, but avoids corner case formatting issues. | `--reference-links` | Use reference links instead of links to create markdown | `--mark-code` | Mark preformatted and code blocks with [code]...[/code] For a complete list of options see the [docs](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) Or you can use it from within `Python`: ``` >>> import html2text >>> >>> print(html2text.html2text("
Zed's dead baby, Zed's dead.
")) **Zed's** dead baby, _Zed's_ dead. ``` Or with some configuration options: ``` >>> import html2text >>> >>> h = html2text.HTML2Text() >>> # Ignore converting links from HTML >>> h.ignore_links = True >>> print h.handle("Hello, world!") Hello, world! >>> print(h.handle("
Hello, world!")) Hello, world! >>> # Don't Ignore links anymore, I like links >>> h.ignore_links = False >>> print(h.handle("
Hello, world!")) Hello, [world](https://www.google.com/earth/)! ``` *Originally written by Aaron Swartz. This code is distributed under the GPLv3.* ## How to install `html2text` is available on pypi https://pypi.org/project/html2text/ ``` $ pip install html2text ``` ## How to run unit tests tox To see the coverage results: coverage html then open the `./htmlcov/index.html` file in your browser. ## Documentation Documentation lives [here](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) html2text-2020.1.16/html2text/ 0000775 0001750 0001750 00000000000 13610070526 015311 5 ustar jon jon 0000000 0000000 html2text-2020.1.16/html2text/__init__.py 0000664 0001750 0001750 00000102141 13610070441 017415 0 ustar jon jon 0000000 0000000 """html2text: Turn HTML into equivalent Markdown-structured text.""" import html.entities import html.parser import re import urllib.parse as urlparse from textwrap import wrap from typing import Dict, List, Optional, Tuple, Union from . import config from .elements import AnchorElement, ListElement from .typing import OutCallback from .utils import ( dumb_css_parser, element_style, escape_md, escape_md_section, google_fixed_width_font, google_has_height, google_list_style, google_text_emphasis, hn, list_numbering_start, pad_tables_in_text, skipwrap, unifiable_n, ) __version__ = (2020, 1, 16) # TODO: # Support decoded entities with UNIFIABLE. class HTML2Text(html.parser.HTMLParser): def __init__( self, out: Optional[OutCallback] = None, baseurl: str = "", bodywidth: int = config.BODY_WIDTH, ) -> None: """ Input parameters: out: possible custom replacement for self.outtextf (which appends lines of text). baseurl: base URL of the document we process """ super().__init__(convert_charrefs=False) # Config options self.split_next_td = False self.td_count = 0 self.table_start = False self.unicode_snob = config.UNICODE_SNOB # covered in cli self.escape_snob = config.ESCAPE_SNOB # covered in cli self.links_each_paragraph = config.LINKS_EACH_PARAGRAPH self.body_width = bodywidth # covered in cli self.skip_internal_links = config.SKIP_INTERNAL_LINKS # covered in cli self.inline_links = config.INLINE_LINKS # covered in cli self.protect_links = config.PROTECT_LINKS # covered in cli self.google_list_indent = config.GOOGLE_LIST_INDENT # covered in cli self.ignore_links = config.IGNORE_ANCHORS # covered in cli self.ignore_images = config.IGNORE_IMAGES # covered in cli self.images_as_html = config.IMAGES_AS_HTML # covered in cli self.images_to_alt = config.IMAGES_TO_ALT # covered in cli self.images_with_size = config.IMAGES_WITH_SIZE # covered in cli self.ignore_emphasis = config.IGNORE_EMPHASIS # covered in cli self.bypass_tables = config.BYPASS_TABLES # covered in cli self.ignore_tables = config.IGNORE_TABLES # covered in cli self.google_doc = False # covered in cli self.ul_item_mark = "*" # covered in cli self.emphasis_mark = "_" # covered in cli self.strong_mark = "**" self.single_line_break = config.SINGLE_LINE_BREAK # covered in cli self.use_automatic_links = config.USE_AUTOMATIC_LINKS # covered in cli self.hide_strikethrough = False # covered in cli self.mark_code = config.MARK_CODE self.wrap_list_items = config.WRAP_LIST_ITEMS # covered in cli self.wrap_links = config.WRAP_LINKS # covered in cli self.pad_tables = config.PAD_TABLES # covered in cli self.default_image_alt = config.DEFAULT_IMAGE_ALT # covered in cli self.tag_callback = None self.open_quote = config.OPEN_QUOTE # covered in cli self.close_quote = config.CLOSE_QUOTE # covered in cli if out is None: self.out = self.outtextf else: self.out = out # empty list to store output characters before they are "joined" self.outtextlist = [] # type: List[str] self.quiet = 0 self.p_p = 0 # number of newline character to print before next output self.outcount = 0 self.start = True self.space = False self.a = [] # type: List[AnchorElement] self.astack = [] # type: List[Optional[Dict[str, Optional[str]]]] self.maybe_automatic_link = None # type: Optional[str] self.empty_link = False self.absolute_url_matcher = re.compile(r"^[a-zA-Z+]+://") self.acount = 0 self.list = [] # type: List[ListElement] self.blockquote = 0 self.pre = False self.startpre = False self.code = False self.quote = False self.br_toggle = "" self.lastWasNL = False self.lastWasList = False self.style = 0 self.style_def = {} # type: Dict[str, Dict[str, str]] self.tag_stack = ( [] ) # type: List[Tuple[str, Dict[str, Optional[str]], Dict[str, str]]] self.emphasis = 0 self.drop_white_space = 0 self.inheader = False # Current abbreviation definition self.abbr_title = None # type: Optional[str] # Last inner HTML (for abbr being defined) self.abbr_data = None # type: Optional[str] # Stack of abbreviations to write later self.abbr_list = {} # type: Dict[str, str] self.baseurl = baseurl self.stressed = False self.preceding_stressed = False self.preceding_data = "" self.current_tag = "" config.UNIFIABLE["nbsp"] = " _place_holder;" def feed(self, data: str) -> None: data = data.replace("' + 'script>", "") super().feed(data) def handle(self, data: str) -> str: self.feed(data) self.feed("") markdown = self.optwrap(self.finish()) if self.pad_tables: return pad_tables_in_text(markdown) else: return markdown def outtextf(self, s: str) -> None: self.outtextlist.append(s) if s: self.lastWasNL = s[-1] == "\n" def finish(self) -> str: self.close() self.pbr() self.o("", force="end") outtext = "".join(self.outtextlist) if self.unicode_snob: nbsp = html.entities.html5["nbsp;"] else: nbsp = " " outtext = outtext.replace(" _place_holder;", nbsp) # Clear self.outtextlist to avoid memory leak of its content to # the next handling. self.outtextlist = [] return outtext def handle_charref(self, c: str) -> None: self.handle_data(self.charref(c), True) def handle_entityref(self, c: str) -> None: ref = self.entityref(c) # ref may be an empty string (e.g. for / markers that should # not contribute to the final output). # self.handle_data cannot handle a zero-length string right after a # stressed tag or mid-text within a stressed tag (text get split and # self.stressed/self.preceding_stressed gets switched after the first # part of that text). if ref: self.handle_data(ref, True) def handle_starttag(self, tag: str, attrs: List[Tuple[str, Optional[str]]]) -> None: self.handle_tag(tag, dict(attrs), start=True) def handle_endtag(self, tag: str) -> None: self.handle_tag(tag, {}, start=False) def previousIndex(self, attrs: Dict[str, Optional[str]]) -> Optional[int]: """ :type attrs: dict :returns: The index of certain set of attributes (of a link) in the self.a list. If the set of attributes is not found, returns None :rtype: int """ if "href" not in attrs: return None match = False for i, a in enumerate(self.a): if "href" in a.attrs and a.attrs["href"] == attrs["href"]: if "title" in a.attrs or "title" in attrs: if ( "title" in a.attrs and "title" in attrs and a.attrs["title"] == attrs["title"] ): match = True else: match = True if match: return i return None def handle_emphasis( self, start: bool, tag_style: Dict[str, str], parent_style: Dict[str, str] ) -> None: """ Handles various text emphases """ tag_emphasis = google_text_emphasis(tag_style) parent_emphasis = google_text_emphasis(parent_style) # handle Google's text emphasis strikethrough = "line-through" in tag_emphasis and self.hide_strikethrough # google and others may mark a font's weight as `bold` or `700` bold = False for bold_marker in config.BOLD_TEXT_STYLE_VALUES: bold = bold_marker in tag_emphasis and bold_marker not in parent_emphasis if bold: break italic = "italic" in tag_emphasis and "italic" not in parent_emphasis fixed = ( google_fixed_width_font(tag_style) and not google_fixed_width_font(parent_style) and not self.pre ) if start: # crossed-out text must be handled before other attributes # in order not to output qualifiers unnecessarily if bold or italic or fixed: self.emphasis += 1 if strikethrough: self.quiet += 1 if italic: self.o(self.emphasis_mark) self.drop_white_space += 1 if bold: self.o(self.strong_mark) self.drop_white_space += 1 if fixed: self.o("`") self.drop_white_space += 1 self.code = True else: if bold or italic or fixed: # there must not be whitespace before closing emphasis mark self.emphasis -= 1 self.space = False if fixed: if self.drop_white_space: # empty emphasis, drop it self.drop_white_space -= 1 else: self.o("`") self.code = False if bold: if self.drop_white_space: # empty emphasis, drop it self.drop_white_space -= 1 else: self.o(self.strong_mark) if italic: if self.drop_white_space: # empty emphasis, drop it self.drop_white_space -= 1 else: self.o(self.emphasis_mark) # space is only allowed after *all* emphasis marks if (bold or italic) and not self.emphasis: self.o(" ") if strikethrough: self.quiet -= 1 def handle_tag( self, tag: str, attrs: Dict[str, Optional[str]], start: bool ) -> None: self.current_tag = tag if self.tag_callback is not None: if self.tag_callback(self, tag, attrs, start) is True: return # first thing inside the anchor tag is another tag # that produces some output if ( start and self.maybe_automatic_link is not None and tag not in ["p", "div", "style", "dl", "dt"] and (tag != "img" or self.ignore_images) ): self.o("[") self.maybe_automatic_link = None self.empty_link = False if self.google_doc: # the attrs parameter is empty for a closing tag. in addition, we # need the attributes of the parent nodes in order to get a # complete style description for the current element. we assume # that google docs export well formed html. parent_style = {} # type: Dict[str, str] if start: if self.tag_stack: parent_style = self.tag_stack[-1][2] tag_style = element_style(attrs, self.style_def, parent_style) self.tag_stack.append((tag, attrs, tag_style)) else: dummy, attrs, tag_style = ( self.tag_stack.pop() if self.tag_stack else (None, {}, {}) ) if self.tag_stack: parent_style = self.tag_stack[-1][2] if hn(tag): self.p() if start: self.inheader = True self.o(hn(tag) * "#" + " ") else: self.inheader = False return # prevent redundant emphasis marks on headers if tag in ["p", "div"]: if self.google_doc: if start and google_has_height(tag_style): self.p() else: self.soft_br() elif self.astack and tag == "div": pass else: self.p() if tag == "br" and start: if self.blockquote > 0: self.o(" \n> ") else: self.o(" \n") if tag == "hr" and start: self.p() self.o("* * *") self.p() if tag in ["head", "style", "script"]: if start: self.quiet += 1 else: self.quiet -= 1 if tag == "style": if start: self.style += 1 else: self.style -= 1 if tag in ["body"]: self.quiet = 0 # sites like 9rules.com never close
if tag == "blockquote": if start: self.p() self.o("> ", force=True) self.start = True self.blockquote += 1 else: self.blockquote -= 1 self.p() def no_preceding_space(self: HTML2Text) -> bool: return bool( self.preceding_data and re.match(r"[^\s]", self.preceding_data[-1]) ) if tag in ["em", "i", "u"] and not self.ignore_emphasis: if start and no_preceding_space(self): emphasis = " " + self.emphasis_mark else: emphasis = self.emphasis_mark self.o(emphasis) if start: self.stressed = True if tag in ["strong", "b"] and not self.ignore_emphasis: if start and no_preceding_space(self): strong = " " + self.strong_mark else: strong = self.strong_mark self.o(strong) if start: self.stressed = True if tag in ["del", "strike", "s"]: if start and no_preceding_space(self): strike = " ~~" else: strike = "~~" self.o(strike) if start: self.stressed = True if self.google_doc: if not self.inheader: # handle some font attributes, but leave headers clean self.handle_emphasis(start, tag_style, parent_style) if tag in ["kbd", "code", "tt"] and not self.pre: self.o("`") # TODO: `` `this` `` self.code = not self.code if tag == "abbr": if start: self.abbr_title = None self.abbr_data = "" if "title" in attrs: self.abbr_title = attrs["title"] else: if self.abbr_title is not None: assert self.abbr_data is not None self.abbr_list[self.abbr_data] = self.abbr_title self.abbr_title = None self.abbr_data = None if tag == "q": if not self.quote: self.o(self.open_quote) else: self.o(self.close_quote) self.quote = not self.quote def link_url(self: HTML2Text, link: str, title: str = "") -> None: url = urlparse.urljoin(self.baseurl, link) title = ' "{}"'.format(title) if title.strip() else "" self.o("]({url}{title})".format(url=escape_md(url), title=title)) if tag == "a" and not self.ignore_links: if start: if ( "href" in attrs and attrs["href"] is not None and not (self.skip_internal_links and attrs["href"].startswith("#")) ): self.astack.append(attrs) self.maybe_automatic_link = attrs["href"] self.empty_link = True if self.protect_links: attrs["href"] = "<" + attrs["href"] + ">" else: self.astack.append(None) else: if self.astack: a = self.astack.pop() if self.maybe_automatic_link and not self.empty_link: self.maybe_automatic_link = None elif a: assert a["href"] is not None if self.empty_link: self.o("[") self.empty_link = False self.maybe_automatic_link = None if self.inline_links: title = a.get("title") or "" title = escape_md(title) link_url(self, a["href"], title) else: i = self.previousIndex(a) if i is not None: a_props = self.a[i] else: self.acount += 1 a_props = AnchorElement(a, self.acount, self.outcount) self.a.append(a_props) self.o("][" + str(a_props.count) + "]") if tag == "img" and start and not self.ignore_images: if "src" in attrs: assert attrs["src"] is not None if not self.images_to_alt: attrs["href"] = attrs["src"] alt = attrs.get("alt") or self.default_image_alt # If we have images_with_size, write raw html including width, # height, and alt attributes if self.images_as_html or ( self.images_with_size and ("width" in attrs or "height" in attrs) ): self.o("") return # If we have a link to create, output the start if self.maybe_automatic_link is not None: href = self.maybe_automatic_link if ( self.images_to_alt and escape_md(alt) == href and self.absolute_url_matcher.match(href) ): self.o("<" + escape_md(alt) + ">") self.empty_link = False return else: self.o("[") self.maybe_automatic_link = None self.empty_link = False # If we have images_to_alt, we discard the image itself, # considering only the alt text. if self.images_to_alt: self.o(escape_md(alt)) else: self.o("![" + escape_md(alt) + "]") if self.inline_links: href = attrs.get("href") or "" self.o( "(" + escape_md(urlparse.urljoin(self.baseurl, href)) + ")" ) else: i = self.previousIndex(attrs) if i is not None: a_props = self.a[i] else: self.acount += 1 a_props = AnchorElement(attrs, self.acount, self.outcount) self.a.append(a_props) self.o("[" + str(a_props.count) + "]") if tag == "dl" and start: self.p() if tag == "dt" and not start: self.pbr() if tag == "dd" and start: self.o(" ") if tag == "dd" and not start: self.pbr() if tag in ["ol", "ul"]: # Google Docs create sub lists as top level lists if not self.list and not self.lastWasList: self.p() if start: if self.google_doc: list_style = google_list_style(tag_style) else: list_style = tag numbering_start = list_numbering_start(attrs) self.list.append(ListElement(list_style, numbering_start)) else: if self.list: self.list.pop() if not self.google_doc and not self.list: self.o("\n") self.lastWasList = True else: self.lastWasList = False if tag == "li": self.pbr() if start: if self.list: li = self.list[-1] else: li = ListElement("ul", 0) if self.google_doc: nest_count = self.google_nest_count(tag_style) else: nest_count = len(self.list) # TODO: line up
- s > 9 correctly. self.o(" " * nest_count) if li.name == "ul": self.o(self.ul_item_mark + " ") elif li.name == "ol": li.num += 1 self.o(str(li.num) + ". ") self.start = True if tag in ["table", "tr", "td", "th"]: if self.ignore_tables: if tag == "tr": if start: pass else: self.soft_br() else: pass elif self.bypass_tables: if start: self.soft_br() if tag in ["td", "th"]: if start: self.o("<{}>\n\n".format(tag)) else: self.o("\n{}>".format(tag)) else: if start: self.o("<{}>".format(tag)) else: self.o("{}>".format(tag)) else: if tag == "table": if start: self.table_start = True if self.pad_tables: self.o("<" + config.TABLE_MARKER_FOR_PAD + ">") self.o(" \n") else: if self.pad_tables: self.o("" + config.TABLE_MARKER_FOR_PAD + ">") self.o(" \n") if tag in ["td", "th"] and start: if self.split_next_td: self.o("| ") self.split_next_td = True if tag == "tr" and start: self.td_count = 0 if tag == "tr" and not start: self.split_next_td = False self.soft_br() if tag == "tr" and not start and self.table_start: # Underline table header self.o("|".join(["---"] * self.td_count)) self.soft_br() self.table_start = False if tag in ["td", "th"] and start: self.td_count += 1 if tag == "pre": if start: self.startpre = True self.pre = True else: self.pre = False if self.mark_code: self.out("\n[/code]") self.p() # TODO: Add docstring for these one letter functions def pbr(self) -> None: "Pretty print has a line break" if self.p_p == 0: self.p_p = 1 def p(self) -> None: "Set pretty print to 1 or 2 lines" self.p_p = 1 if self.single_line_break else 2 def soft_br(self) -> None: "Soft breaks" self.pbr() self.br_toggle = " " def o( self, data: str, puredata: bool = False, force: Union[bool, str] = False ) -> None: """ Deal with indentation and whitespace """ if self.abbr_data is not None: self.abbr_data += data if not self.quiet: if self.google_doc: # prevent white space immediately after 'begin emphasis' # marks ('**' and '_') lstripped_data = data.lstrip() if self.drop_white_space and not (self.pre or self.code): data = lstripped_data if lstripped_data != "": self.drop_white_space = 0 if puredata and not self.pre: # This is a very dangerous call ... it could mess up # all handling of when not handled properly # (see entityref) data = re.sub(r"\s+", r" ", data) if data and data[0] == " ": self.space = True data = data[1:] if not data and not force: return if self.startpre: # self.out(" :") #TODO: not output when already one there if not data.startswith("\n") and not data.startswith("\r\n"): #
stuff... data = "\n" + data if self.mark_code: self.out("\n[code]") self.p_p = 0 bq = ">" * self.blockquote if not (force and data and data[0] == ">") and self.blockquote: bq += " " if self.pre: if not self.list: bq += " " # else: list content is already partially indented bq += " " * len(self.list) data = data.replace("\n", "\n" + bq) if self.startpre: self.startpre = False if self.list: # use existing initial indentation data = data.lstrip("\n") if self.start: self.space = False self.p_p = 0 self.start = False if force == "end": # It's the end. self.p_p = 0 self.out("\n") self.space = False if self.p_p: self.out((self.br_toggle + "\n" + bq) * self.p_p) self.space = False self.br_toggle = "" if self.space: if not self.lastWasNL: self.out(" ") self.space = False if self.a and ( (self.p_p == 2 and self.links_each_paragraph) or force == "end" ): if force == "end": self.out("\n") newa = [] for link in self.a: if self.outcount > link.outcount: self.out( " [" + str(link.count) + "]: " + urlparse.urljoin(self.baseurl, link.attrs["href"]) ) if "title" in link.attrs: assert link.attrs["title"] is not None self.out(" (" + link.attrs["title"] + ")") self.out("\n") else: newa.append(link) # Don't need an extra line when nothing was done. if self.a != newa: self.out("\n") self.a = newa if self.abbr_list and force == "end": for abbr, definition in self.abbr_list.items(): self.out(" *[" + abbr + "]: " + definition + "\n") self.p_p = 0 self.out(data) self.outcount += 1 def handle_data(self, data: str, entity_char: bool = False) -> None: if not data: # Data may be empty for some HTML entities. For example, # LEFT-TO-RIGHT MARK. return if self.stressed: data = data.strip() self.stressed = False self.preceding_stressed = True elif self.preceding_stressed: if ( re.match(r"[^\s.!?]", data[0]) and not hn(self.current_tag) and self.current_tag not in ["a", "code", "pre"] ): # should match a letter or common punctuation data = " " + data self.preceding_stressed = False if self.style: self.style_def.update(dumb_css_parser(data)) if self.maybe_automatic_link is not None: href = self.maybe_automatic_link if ( href == data and self.absolute_url_matcher.match(href) and self.use_automatic_links ): self.o("<" + data + ">") self.empty_link = False return else: self.o("[") self.maybe_automatic_link = None self.empty_link = False if not self.code and not self.pre and not entity_char: data = escape_md_section(data, snob=self.escape_snob) self.preceding_data = data self.o(data, puredata=True) def charref(self, name: str) -> str: if name[0] in ["x", "X"]: c = int(name[1:], 16) else: c = int(name) if not self.unicode_snob and c in unifiable_n: return unifiable_n[c] else: try: return chr(c) except ValueError: # invalid unicode return "" def entityref(self, c: str) -> str: if not self.unicode_snob and c in config.UNIFIABLE: return config.UNIFIABLE[c] try: ch = html.entities.html5[c + ";"] except KeyError: return "&" + c + ";" return config.UNIFIABLE[c] if c == "nbsp" else ch def google_nest_count(self, style: Dict[str, str]) -> int: """ Calculate the nesting count of google doc lists :type style: dict :rtype: int """ nest_count = 0 if "margin-left" in style: nest_count = int(style["margin-left"][:-2]) // self.google_list_indent return nest_count def optwrap(self, text: str) -> str: """ Wrap all paragraphs in the provided text. :type text: str :rtype: str """ if not self.body_width: return text result = "" newlines = 0 # I cannot think of a better solution for now. # To avoid the non-wrap behaviour for entire paras # because of the presence of a link in it if not self.wrap_links: self.inline_links = False for para in text.split("\n"): if len(para) > 0: if not skipwrap(para, self.wrap_links, self.wrap_list_items): indent = "" if para.startswith(" " + self.ul_item_mark): # list item continuation: add a double indent to the # new lines indent = " " elif para.startswith("> "): # blockquote continuation: add the greater than symbol # to the new lines indent = "> " wrapped = wrap( para, self.body_width, break_long_words=False, subsequent_indent=indent, ) result += "\n".join(wrapped) if para.endswith(" "): result += " \n" newlines = 1 elif indent: result += "\n" newlines = 1 else: result += "\n\n" newlines = 2 else: # Warning for the tempted!!! # Be aware that obvious replacement of this with # line.isspace() # DOES NOT work! Explanations are welcome. if not config.RE_SPACE.match(para): result += para + "\n" newlines = 1 else: if newlines < 2: result += "\n" newlines += 1 return result def html2text(html: str, baseurl: str = "", bodywidth: Optional[int] = None) -> str: if bodywidth is None: bodywidth = config.BODY_WIDTH h = HTML2Text(baseurl=baseurl, bodywidth=bodywidth) return h.handle(html) html2text-2020.1.16/html2text/__main__.py 0000664 0001750 0001750 00000000036 13550373562 017413 0 ustar jon jon 0000000 0000000 from .cli import main main() html2text-2020.1.16/html2text/cli.py 0000664 0001750 0001750 00000022040 13550376531 016440 0 ustar jon jon 0000000 0000000 import argparse import sys from . import HTML2Text, __version__, config def main() -> None: baseurl = "" class bcolors: HEADER = "\033[95m" OKBLUE = "\033[94m" OKGREEN = "\033[92m" WARNING = "\033[93m" FAIL = "\033[91m" ENDC = "\033[0m" BOLD = "\033[1m" UNDERLINE = "\033[4m" p = argparse.ArgumentParser() p.add_argument( "--default-image-alt", dest="default_image_alt", default=config.DEFAULT_IMAGE_ALT, help="The default alt string for images with missing ones", ) p.add_argument( "--pad-tables", dest="pad_tables", action="store_true", default=config.PAD_TABLES, help="pad the cells to equal column width in tables", ) p.add_argument( "--no-wrap-links", dest="wrap_links", action="store_false", default=config.WRAP_LINKS, help="don't wrap links during conversion", ) p.add_argument( "--wrap-list-items", dest="wrap_list_items", action="store_true", default=config.WRAP_LIST_ITEMS, help="wrap list items during conversion", ) p.add_argument( "--ignore-emphasis", dest="ignore_emphasis", action="store_true", default=config.IGNORE_EMPHASIS, help="don't include any formatting for emphasis", ) p.add_argument( "--reference-links", dest="inline_links", action="store_false", default=config.INLINE_LINKS, help="use reference style links instead of inline links", ) p.add_argument( "--ignore-links", dest="ignore_links", action="store_true", default=config.IGNORE_ANCHORS, help="don't include any formatting for links", ) p.add_argument( "--protect-links", dest="protect_links", action="store_true", default=config.PROTECT_LINKS, help="protect links from line breaks surrounding them with angle brackets", ) p.add_argument( "--ignore-images", dest="ignore_images", action="store_true", default=config.IGNORE_IMAGES, help="don't include any formatting for images", ) p.add_argument( "--images-as-html", dest="images_as_html", action="store_true", default=config.IMAGES_AS_HTML, help=( "Always write image tags as raw html; preserves `height`, `width` and " "`alt` if possible." ), ) p.add_argument( "--images-to-alt", dest="images_to_alt", action="store_true", default=config.IMAGES_TO_ALT, help="Discard image data, only keep alt text", ) p.add_argument( "--images-with-size", dest="images_with_size", action="store_true", default=config.IMAGES_WITH_SIZE, help=( "Write image tags with height and width attrs as raw html to retain " "dimensions" ), ) p.add_argument( "-g", "--google-doc", action="store_true", dest="google_doc", default=False, help="convert an html-exported Google Document", ) p.add_argument( "-d", "--dash-unordered-list", action="store_true", dest="ul_style_dash", default=False, help="use a dash rather than a star for unordered list items", ) p.add_argument( "-e", "--asterisk-emphasis", action="store_true", dest="em_style_asterisk", default=False, help="use an asterisk rather than an underscore for emphasized text", ) p.add_argument( "-b", "--body-width", dest="body_width", type=int, default=config.BODY_WIDTH, help="number of characters per output line, 0 for no wrap", ) p.add_argument( "-i", "--google-list-indent", dest="list_indent", type=int, default=config.GOOGLE_LIST_INDENT, help="number of pixels Google indents nested lists", ) p.add_argument( "-s", "--hide-strikethrough", action="store_true", dest="hide_strikethrough", default=False, help="hide strike-through text. only relevant when -g is " "specified as well", ) p.add_argument( "--escape-all", action="store_true", dest="escape_snob", default=False, help=( "Escape all special characters. Output is less readable, but avoids " "corner case formatting issues." ), ) p.add_argument( "--bypass-tables", action="store_true", dest="bypass_tables", default=config.BYPASS_TABLES, help="Format tables in HTML rather than Markdown syntax.", ) p.add_argument( "--ignore-tables", action="store_true", dest="ignore_tables", default=config.IGNORE_TABLES, help="Ignore table-related tags (table, th, td, tr) " "while keeping rows.", ) p.add_argument( "--single-line-break", action="store_true", dest="single_line_break", default=config.SINGLE_LINE_BREAK, help=( "Use a single line break after a block element rather than two line " "breaks. NOTE: Requires --body-width=0" ), ) p.add_argument( "--unicode-snob", action="store_true", dest="unicode_snob", default=config.UNICODE_SNOB, help="Use unicode throughout document", ) p.add_argument( "--no-automatic-links", action="store_false", dest="use_automatic_links", default=config.USE_AUTOMATIC_LINKS, help="Do not use automatic links wherever applicable", ) p.add_argument( "--no-skip-internal-links", action="store_false", dest="skip_internal_links", default=config.SKIP_INTERNAL_LINKS, help="Do not skip internal links", ) p.add_argument( "--links-after-para", action="store_true", dest="links_each_paragraph", default=config.LINKS_EACH_PARAGRAPH, help="Put links after each paragraph instead of document", ) p.add_argument( "--mark-code", action="store_true", dest="mark_code", default=config.MARK_CODE, help="Mark program code blocks with [code]...[/code]", ) p.add_argument( "--decode-errors", dest="decode_errors", default=config.DECODE_ERRORS, help=( "What to do in case of decode errors.'ignore', 'strict' and 'replace' are " "acceptable values" ), ) p.add_argument( "--open-quote", dest="open_quote", default=config.OPEN_QUOTE, help="The character used to open quotes", ) p.add_argument( "--close-quote", dest="close_quote", default=config.CLOSE_QUOTE, help="The character used to close quotes", ) p.add_argument( "--version", action="version", version=".".join(map(str, __version__)) ) p.add_argument("filename", nargs="?") p.add_argument("encoding", nargs="?", default="utf-8") args = p.parse_args() if args.filename and args.filename != "-": with open(args.filename, "rb") as fp: data = fp.read() else: data = sys.stdin.buffer.read() try: html = data.decode(args.encoding, args.decode_errors) except UnicodeDecodeError as err: warning = bcolors.WARNING + "Warning:" + bcolors.ENDC warning += " Use the " + bcolors.OKGREEN warning += "--decode-errors=ignore" + bcolors.ENDC + " flag." print(warning) raise err h = HTML2Text(baseurl=baseurl) # handle options if args.ul_style_dash: h.ul_item_mark = "-" if args.em_style_asterisk: h.emphasis_mark = "*" h.strong_mark = "__" h.body_width = args.body_width h.google_list_indent = args.list_indent h.ignore_emphasis = args.ignore_emphasis h.ignore_links = args.ignore_links h.protect_links = args.protect_links h.ignore_images = args.ignore_images h.images_as_html = args.images_as_html h.images_to_alt = args.images_to_alt h.images_with_size = args.images_with_size h.google_doc = args.google_doc h.hide_strikethrough = args.hide_strikethrough h.escape_snob = args.escape_snob h.bypass_tables = args.bypass_tables h.ignore_tables = args.ignore_tables h.single_line_break = args.single_line_break h.inline_links = args.inline_links h.unicode_snob = args.unicode_snob h.use_automatic_links = args.use_automatic_links h.skip_internal_links = args.skip_internal_links h.links_each_paragraph = args.links_each_paragraph h.mark_code = args.mark_code h.wrap_links = args.wrap_links h.wrap_list_items = args.wrap_list_items h.pad_tables = args.pad_tables h.default_image_alt = args.default_image_alt h.open_quote = args.open_quote h.close_quote = args.close_quote sys.stdout.write(h.handle(html)) html2text-2020.1.16/html2text/config.py 0000664 0001750 0001750 00000007401 13525235166 017142 0 ustar jon jon 0000000 0000000 import re # Use Unicode characters instead of their ascii pseudo-replacements UNICODE_SNOB = False # Marker to use for marking tables for padding post processing TABLE_MARKER_FOR_PAD = "special_marker_for_table_padding" # Escape all special characters. Output is less readable, but avoids # corner case formatting issues. ESCAPE_SNOB = False # Put the links after each paragraph instead of at the end. LINKS_EACH_PARAGRAPH = False # Wrap long lines at position. 0 for no wrapping. BODY_WIDTH = 78 # Don't show internal links (href="#local-anchor") -- corresponding link # targets won't be visible in the plain text file anyway. SKIP_INTERNAL_LINKS = True # Use inline, rather than reference, formatting for images and links INLINE_LINKS = True # Protect links from line breaks surrounding them with angle brackets (in # addition to their square brackets) PROTECT_LINKS = False # WRAP_LINKS = True WRAP_LINKS = True # Wrap list items. WRAP_LIST_ITEMS = False # Number of pixels Google indents nested lists GOOGLE_LIST_INDENT = 36 # Values Google and others may use to indicate bold text BOLD_TEXT_STYLE_VALUES = ("bold", "700", "800", "900") IGNORE_ANCHORS = False IGNORE_IMAGES = False IMAGES_AS_HTML = False IMAGES_TO_ALT = False IMAGES_WITH_SIZE = False IGNORE_EMPHASIS = False MARK_CODE = False DECODE_ERRORS = "strict" DEFAULT_IMAGE_ALT = "" PAD_TABLES = False # Convert links with same href and text toformat # if they are absolute links USE_AUTOMATIC_LINKS = True # For checking space-only lines on line 771 RE_SPACE = re.compile(r"\s\+") RE_ORDERED_LIST_MATCHER = re.compile(r"\d+\.\s") RE_UNORDERED_LIST_MATCHER = re.compile(r"[-\*\+]\s") RE_MD_CHARS_MATCHER = re.compile(r"([\\\[\]\(\)])") RE_MD_CHARS_MATCHER_ALL = re.compile(r"([`\*_{}\[\]\(\)#!])") # to find links in the text RE_LINK = re.compile(r"(\[.*?\] ?\(.*?\))|(\[.*?\]:.*?)") RE_MD_DOT_MATCHER = re.compile( r""" ^ # start of line (\s*\d+) # optional whitespace and a number (\.) # dot (?=\s) # lookahead assert whitespace """, re.MULTILINE | re.VERBOSE, ) RE_MD_PLUS_MATCHER = re.compile( r""" ^ (\s*) (\+) (?=\s) """, flags=re.MULTILINE | re.VERBOSE, ) RE_MD_DASH_MATCHER = re.compile( r""" ^ (\s*) (-) (?=\s|\-) # followed by whitespace (bullet list, or spaced out hr) # or another dash (header or hr) """, flags=re.MULTILINE | re.VERBOSE, ) RE_SLASH_CHARS = r"\`*_{}[]()#+-.!" RE_MD_BACKSLASH_MATCHER = re.compile( r""" (\\) # match one slash (?=[%s]) # followed by a char that requires escaping """ % re.escape(RE_SLASH_CHARS), flags=re.VERBOSE, ) UNIFIABLE = { "rsquo": "'", "lsquo": "'", "rdquo": '"', "ldquo": '"', "copy": "(C)", "mdash": "--", "nbsp": " ", "rarr": "->", "larr": "<-", "middot": "*", "ndash": "-", "oelig": "oe", "aelig": "ae", "agrave": "a", "aacute": "a", "acirc": "a", "atilde": "a", "auml": "a", "aring": "a", "egrave": "e", "eacute": "e", "ecirc": "e", "euml": "e", "igrave": "i", "iacute": "i", "icirc": "i", "iuml": "i", "ograve": "o", "oacute": "o", "ocirc": "o", "otilde": "o", "ouml": "o", "ugrave": "u", "uacute": "u", "ucirc": "u", "uuml": "u", "lrm": "", "rlm": "", } # Format tables in HTML rather than Markdown syntax BYPASS_TABLES = False # Ignore table-related tags (table, th, td, tr) while keeping rows IGNORE_TABLES = False # Use a single line break after a block element rather than two line breaks. # NOTE: Requires body width setting to be 0. SINGLE_LINE_BREAK = False # Use double quotation marks when converting the tag. OPEN_QUOTE = '"' CLOSE_QUOTE = '"' html2text-2020.1.16/html2text/elements.py 0000664 0001750 0001750 00000000647 13550376531 017516 0 ustar jon jon 0000000 0000000 from typing import Dict, Optional class AnchorElement: __slots__ = ["attrs", "count", "outcount"] def __init__(self, attrs: Dict[str, Optional[str]], count: int, outcount: int): self.attrs = attrs self.count = count self.outcount = outcount class ListElement: __slots__ = ["name", "num"] def __init__(self, name: str, num: int): self.name = name self.num = num html2text-2020.1.16/html2text/py.typed 0000664 0001750 0001750 00000000000 13550376531 017006 0 ustar jon jon 0000000 0000000 html2text-2020.1.16/html2text/typing.py 0000664 0001750 0001750 00000000107 13550376531 017203 0 ustar jon jon 0000000 0000000 class OutCallback: def __call__(self, s: str) -> None: ... html2text-2020.1.16/html2text/utils.py 0000664 0001750 0001750 00000017544 13610067134 017037 0 ustar jon jon 0000000 0000000 import html.entities from typing import Dict, List, Optional from . import config unifiable_n = { html.entities.name2codepoint[k]: v for k, v in config.UNIFIABLE.items() if k != "nbsp" } def hn(tag: str) -> int: if tag[0] == "h" and len(tag) == 2: n = tag[1] if "0" < n <= "9": return int(n) return 0 def dumb_property_dict(style: str) -> Dict[str, str]: """ :returns: A hash of css attributes """ return { x.strip().lower(): y.strip().lower() for x, y in [z.split(":", 1) for z in style.split(";") if ":" in z] } def dumb_css_parser(data: str) -> Dict[str, Dict[str, str]]: """ :type data: str :returns: A hash of css selectors, each of which contains a hash of css attributes. :rtype: dict """ # remove @import sentences data += ";" importIndex = data.find("@import") while importIndex != -1: data = data[0:importIndex] + data[data.find(";", importIndex) + 1 :] importIndex = data.find("@import") # parse the css. reverted from dictionary comprehension in order to # support older pythons pairs = [x.split("{") for x in data.split("}") if "{" in x.strip()] try: elements = {a.strip(): dumb_property_dict(b) for a, b in pairs} except ValueError: elements = {} # not that important return elements def element_style( attrs: Dict[str, Optional[str]], style_def: Dict[str, Dict[str, str]], parent_style: Dict[str, str], ) -> Dict[str, str]: """ :type attrs: dict :type style_def: dict :type style_def: dict :returns: A hash of the 'final' style attributes of the element :rtype: dict """ style = parent_style.copy() if "class" in attrs: assert attrs["class"] is not None for css_class in attrs["class"].split(): css_style = style_def.get("." + css_class, {}) style.update(css_style) if "style" in attrs: assert attrs["style"] is not None immediate_style = dumb_property_dict(attrs["style"]) style.update(immediate_style) return style def google_list_style(style: Dict[str, str]) -> str: """ Finds out whether this is an ordered or unordered list :type style: dict :rtype: str """ if "list-style-type" in style: list_style = style["list-style-type"] if list_style in ["disc", "circle", "square", "none"]: return "ul" return "ol" def google_has_height(style: Dict[str, str]) -> bool: """ Check if the style of the element has the 'height' attribute explicitly defined :type style: dict :rtype: bool """ return "height" in style def google_text_emphasis(style: Dict[str, str]) -> List[str]: """ :type style: dict :returns: A list of all emphasis modifiers of the element :rtype: list """ emphasis = [] if "text-decoration" in style: emphasis.append(style["text-decoration"]) if "font-style" in style: emphasis.append(style["font-style"]) if "font-weight" in style: emphasis.append(style["font-weight"]) return emphasis def google_fixed_width_font(style: Dict[str, str]) -> bool: """ Check if the css of the current element defines a fixed width font :type style: dict :rtype: bool """ font_family = "" if "font-family" in style: font_family = style["font-family"] return "courier new" == font_family or "consolas" == font_family def list_numbering_start(attrs: Dict[str, Optional[str]]) -> int: """ Extract numbering from list element attributes :type attrs: dict :rtype: int or None """ if "start" in attrs: assert attrs["start"] is not None try: return int(attrs["start"]) - 1 except ValueError: pass return 0 def skipwrap(para: str, wrap_links: bool, wrap_list_items: bool) -> bool: # If it appears to contain a link # don't wrap if not wrap_links and config.RE_LINK.search(para): return True # If the text begins with four spaces or one tab, it's a code block; # don't wrap if para[0:4] == " " or para[0] == "\t": return True # If the text begins with only two "--", possibly preceded by # whitespace, that's an emdash; so wrap. stripped = para.lstrip() if stripped[0:2] == "--" and len(stripped) > 2 and stripped[2] != "-": return False # I'm not sure what this is for; I thought it was to detect lists, # but there's a
-inside- case in one of the tests that # also depends upon it. if stripped[0:1] in ("-", "*") and not stripped[0:2] == "**": return not wrap_list_items # If the text begins with a single -, *, or +, followed by a space, # or an integer, followed by a ., followed by a space (in either # case optionally proceeded by whitespace), it's a list; don't wrap. return bool( config.RE_ORDERED_LIST_MATCHER.match(stripped) or config.RE_UNORDERED_LIST_MATCHER.match(stripped) ) def escape_md(text: str) -> str: """ Escapes markdown-sensitive characters within other markdown constructs. """ return config.RE_MD_CHARS_MATCHER.sub(r"\\\1", text) def escape_md_section(text: str, snob: bool = False) -> str: """ Escapes markdown-sensitive characters across whole document sections. """ text = config.RE_MD_BACKSLASH_MATCHER.sub(r"\\\1", text) if snob: text = config.RE_MD_CHARS_MATCHER_ALL.sub(r"\\\1", text) text = config.RE_MD_DOT_MATCHER.sub(r"\1\\\2", text) text = config.RE_MD_PLUS_MATCHER.sub(r"\1\\\2", text) text = config.RE_MD_DASH_MATCHER.sub(r"\1\\\2", text) return text def reformat_table(lines: List[str], right_margin: int) -> List[str]: """ Given the lines of a table padds the cells and returns the new lines """ # find the maximum width of the columns max_width = [len(x.rstrip()) + right_margin for x in lines[0].split("|")] max_cols = len(max_width) for line in lines: cols = [x.rstrip() for x in line.split("|")] num_cols = len(cols) # don't drop any data if colspan attributes result in unequal lengths if num_cols < max_cols: cols += [""] * (max_cols - num_cols) elif max_cols < num_cols: max_width += [len(x) + right_margin for x in cols[-(num_cols - max_cols) :]] max_cols = num_cols max_width = [ max(len(x) + right_margin, old_len) for x, old_len in zip(cols, max_width) ] # reformat new_lines = [] for line in lines: cols = [x.rstrip() for x in line.split("|")] if set(line.strip()) == set("-|"): filler = "-" new_cols = [ x.rstrip() + (filler * (M - len(x.rstrip()))) for x, M in zip(cols, max_width) ] else: filler = " " new_cols = [ x.rstrip() + (filler * (M - len(x.rstrip()))) for x, M in zip(cols, max_width) ] new_lines.append("|".join(new_cols)) return new_lines def pad_tables_in_text(text: str, right_margin: int = 1) -> str: """ Provide padding for tables in the text """ lines = text.split("\n") table_buffer = [] # type: List[str] table_started = False new_lines = [] for line in lines: # Toggle table started if config.TABLE_MARKER_FOR_PAD in line: table_started = not table_started if not table_started: table = reformat_table(table_buffer, right_margin) new_lines.extend(table) table_buffer = [] new_lines.append("") continue # Process lines if table_started: table_buffer.append(line) else: new_lines.append(line) return "\n".join(new_lines) html2text-2020.1.16/html2text.egg-info/ 0000775 0001750 0001750 00000000000 13610070526 017003 5 ustar jon jon 0000000 0000000 html2text-2020.1.16/html2text.egg-info/PKG-INFO 0000664 0001750 0001750 00000011564 13610070526 020107 0 ustar jon jon 0000000 0000000 Metadata-Version: 2.1 Name: html2text Version: 2020.1.16 Summary: Turn HTML into equivalent Markdown-structured text. Home-page: https://github.com/Alir3z4/html2text/ Author: Aaron Swartz Author-email: me@aaronsw.com Maintainer: Alireza Savand Maintainer-email: alireza.savand@gmail.com License: GNU GPL 3 Description: # html2text [](https://travis-ci.org/Alir3z4/html2text) [](https://coveralls.io/r/Alir3z4/html2text) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) [](https://pypi.org/project/html2text/) html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to be valid Markdown (a text-to-HTML format). Usage: `html2text [filename [encoding]]` | Option | Description |--------------------------------------------------------|--------------------------------------------------- | `--version` | Show program's version number and exit | `-h`, `--help` | Show this help message and exit | `--ignore-links` | Don't include any formatting for links |`--escape-all` | Escape all special characters. Output is less readable, but avoids corner case formatting issues. | `--reference-links` | Use reference links instead of links to create markdown | `--mark-code` | Mark preformatted and code blocks with [code]...[/code] For a complete list of options see the [docs](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) Or you can use it from within `Python`: ``` >>> import html2text >>> >>> print(html2text.html2text("Zed's dead baby, Zed's dead.
")) **Zed's** dead baby, _Zed's_ dead. ``` Or with some configuration options: ``` >>> import html2text >>> >>> h = html2text.HTML2Text() >>> # Ignore converting links from HTML >>> h.ignore_links = True >>> print h.handle("Hello, world!") Hello, world! >>> print(h.handle("
Hello, world!")) Hello, world! >>> # Don't Ignore links anymore, I like links >>> h.ignore_links = False >>> print(h.handle("
Hello, world!")) Hello, [world](https://www.google.com/earth/)! ``` *Originally written by Aaron Swartz. This code is distributed under the GPLv3.* ## How to install `html2text` is available on pypi https://pypi.org/project/html2text/ ``` $ pip install html2text ``` ## How to run unit tests tox To see the coverage results: coverage html then open the `./htmlcov/index.html` file in your browser. ## Documentation Documentation lives [here](https://github.com/Alir3z4/html2text/blob/master/docs/usage.md) Platform: OS Independent Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: GNU General Public License (GPL) Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.5 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3 :: Only Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Requires-Python: >=3.5 Description-Content-Type: text/markdown html2text-2020.1.16/html2text.egg-info/SOURCES.txt 0000664 0001750 0001750 00000007222 13610070526 020672 0 ustar jon jon 0000000 0000000 AUTHORS.rst COPYING ChangeLog.rst MANIFEST.in README.md setup.cfg setup.py tox.ini html2text/__init__.py html2text/__main__.py html2text/cli.py html2text/config.py html2text/elements.py html2text/py.typed html2text/typing.py html2text/utils.py html2text.egg-info/PKG-INFO html2text.egg-info/SOURCES.txt html2text.egg-info/dependency_links.txt html2text.egg-info/entry_points.txt html2text.egg-info/not-zip-safe html2text.egg-info/top_level.txt test/GoogleDocMassDownload.html test/GoogleDocMassDownload.md test/GoogleDocSaved.html test/GoogleDocSaved.md test/GoogleDocSaved_two.html test/GoogleDocSaved_two.md test/__init__.py test/abbr_tag.html test/abbr_tag.md test/anchors.html test/anchors.md test/apos_element.html test/apos_element.md test/blockquote_example.html test/blockquote_example.md test/bodywidth_newline.html test/bodywidth_newline.md test/bold_inside_link.html test/bold_inside_link.md test/bold_long_line.html test/bold_long_line.md test/break_preserved_in_blockquote.html test/break_preserved_in_blockquote.md test/css_import_no_semicolon.html test/css_import_no_semicolon.md test/decript_tage.html test/decript_tage.md test/default_image_alt.html test/default_image_alt.md test/doc_with_table.html test/doc_with_table.md test/doc_with_table_bypass.html test/doc_with_table_bypass.md test/emdash-para.html test/emdash-para.md test/emphasis_preserved_whitespace.html test/emphasis_preserved_whitespace.md test/empty-link.html test/empty-link.md test/empty-title-tag.html test/empty-title-tag.md test/flip_emphasis.html test/flip_emphasis.md test/google-like_font-properties.html test/google-like_font-properties.md test/header_tags.html test/header_tags.md test/horizontal_rule.html test/horizontal_rule.md test/html-escaping.html test/html-escaping.md test/html_entities_out_of_text.html test/html_entities_out_of_text.md test/images_as_html.html test/images_as_html.md test/images_to_alt.html test/images_to_alt.md test/images_with_div_wrap.html test/images_with_div_wrap.md test/images_with_size.html test/images_with_size.md test/img-tag-with-link.html test/img-tag-with-link.md test/inplace_baseurl_substitution.html test/inplace_baseurl_substitution.md test/invalid_start.html test/invalid_start.md test/invalid_unicode.html test/invalid_unicode.md test/kbd_tag.html test/kbd_tag.md test/link_titles.html test/link_titles.md test/list_tags_example.html test/list_tags_example.md test/long_lines.html test/long_lines.md test/lrm_after_b.html test/lrm_after_b.md test/lrm_after_i.html test/lrm_after_i.md test/lrm_inside_i.html test/lrm_inside_i.md test/mark_code.html test/mark_code.md test/nbsp.html test/nbsp.md test/nbsp_unicode.html test/nbsp_unicode.md test/no_inline_links_example.html test/no_inline_links_example.md test/no_inline_links_images_to_alt.html test/no_inline_links_images_to_alt.md test/no_inline_links_nested.html test/no_inline_links_nested.md test/no_wrap_links.html test/no_wrap_links.md test/no_wrap_links_no_inline_links.html test/no_wrap_links_no_inline_links.md test/normal.html test/normal.md test/normal_escape_snob.html test/normal_escape_snob.md test/pad_table.html test/pad_table.md test/pre.html test/pre.md test/preformatted_in_list.html test/preformatted_in_list.md test/protect_links.html test/protect_links.md test/q_tag.html test/q_tag.md test/rlm_inside_strong.html test/rlm_inside_strong.md test/single_line_break.html test/single_line_break.md test/stressed_with_html_entities.html test/stressed_with_html_entities.md test/table_ignore.html test/table_ignore.md test/test_html2text.py test/test_memleak.py test/text_after_list.html test/text_after_list.md test/url-escaping.html test/url-escaping.md test/wrap_list_items_example.html test/wrap_list_items_example.md html2text-2020.1.16/html2text.egg-info/dependency_links.txt 0000664 0001750 0001750 00000000001 13610070526 023051 0 ustar jon jon 0000000 0000000 html2text-2020.1.16/html2text.egg-info/entry_points.txt 0000664 0001750 0001750 00000000062 13610070526 022277 0 ustar jon jon 0000000 0000000 [console_scripts] html2text = html2text.cli:main html2text-2020.1.16/html2text.egg-info/not-zip-safe 0000664 0001750 0001750 00000000001 13610070526 021231 0 ustar jon jon 0000000 0000000 html2text-2020.1.16/html2text.egg-info/top_level.txt 0000664 0001750 0001750 00000000012 13610070526 021526 0 ustar jon jon 0000000 0000000 html2text html2text-2020.1.16/setup.cfg 0000664 0001750 0001750 00000002620 13610070526 015177 0 ustar jon jon 0000000 0000000 [metadata] name = html2text version = attr: html2text.__version__ description = Turn HTML into equivalent Markdown-structured text. long_description = file: README.md long_description_content_type = text/markdown url = https://github.com/Alir3z4/html2text/ author = Aaron Swartz author_email = me@aaronsw.com maintainer = Alireza Savand maintainer_email = alireza.savand@gmail.com license = GNU GPL 3 classifiers = Development Status :: 5 - Production/Stable Intended Audience :: Developers License :: OSI Approved :: GNU General Public License (GPL) Operating System :: OS Independent Programming Language :: Python Programming Language :: Python :: 3 Programming Language :: Python :: 3.5 Programming Language :: Python :: 3.6 Programming Language :: Python :: 3.7 Programming Language :: Python :: 3.8 Programming Language :: Python :: 3 :: Only Programming Language :: Python :: Implementation :: CPython Programming Language :: Python :: Implementation :: PyPy platform = OS Independent [options] zip_safe = False packages = html2text python_requires = >=3.5 [options.entry_points] console_scripts = html2text = html2text.cli:main [options.package_data] html2text = py.typed [flake8] max_line_length = 88 ignore = E203 W503 [isort] combine_as_imports = True include_trailing_comma = True line_length = 88 multi_line_output = 3 [mypy] python_version = 3.5 [egg_info] tag_build = tag_date = 0 html2text-2020.1.16/setup.py 0000664 0001750 0001750 00000000046 13556624753 015110 0 ustar jon jon 0000000 0000000 from setuptools import setup setup() html2text-2020.1.16/test/ 0000775 0001750 0001750 00000000000 13610070526 014335 5 ustar jon jon 0000000 0000000 html2text-2020.1.16/test/GoogleDocMassDownload.html 0000664 0001750 0001750 00000013521 13542620133 021402 0 ustar jon jon 0000000 0000000
Sandbox test doc
first issue
- bit
- bold italic
- orange
- apple
- final
text to separate lists
- now with numbers
- the prisoner
- not an italic number
- a bold human being
- end
bold
italic
def func(x):
if x < 1:
return 'a'
return 'b'
Some fixed width text here
italic fixed width text
html2text-2020.1.16/test/GoogleDocMassDownload.md 0000664 0001750 0001750 00000000643 13542620133 021037 0 ustar jon jon 0000000 0000000 # test doc first issue - bit - _**bold italic**_ - orange - apple - final text to separate lists 1. now with numbers 2. the prisoner 1. not an _italic number_ 2. a **bold human** being 3. end **bold** _italic_ ` def func(x):` ` if x < 1:` ` return 'a'` ` return 'b'` Some ` fixed width text` here _` italic fixed width text`_ html2text-2020.1.16/test/GoogleDocSaved.html 0000664 0001750 0001750 00000007430 13542620133 020053 0 ustar jon jon 0000000 0000000
Sandbox test doc
first issue
- bit
- bold italic
- orange
- apple
- final
text to separate lists
- now with numbers
- the prisoner
- not an italic number
- a bold human being
- end
bold
italic
def func(x):
if x < 1:
return 'a'
return 'b'
Some fixed width text here
italic fixed width text
html2text-2020.1.16/test/GoogleDocSaved.md 0000664 0001750 0001750 00000000643 13542620133 017506 0 ustar jon jon 0000000 0000000 # test doc first issue - bit - _**bold italic**_ - orange - apple - final text to separate lists 1. now with numbers 2. the prisoner 1. not an _italic number_ 2. a **bold human** being 3. end **bold** _italic_ ` def func(x):` ` if x < 1:` ` return 'a'` ` return 'b'` Some ` fixed width text` here _` italic fixed width text`_ html2text-2020.1.16/test/GoogleDocSaved_two.html 0000664 0001750 0001750 00000007465 13542620133 020754 0 ustar jon jon 0000000 0000000
Sandbox test doc
first issue
- bit
- bold italic
- orange
- apple
- final
text to separate lists
- now with numbers
- the prisoner
- not an italic number
- a bold human being
- end
bold
italic
def func(x):
if x < 1:
return 'a'
return 'b'
Some fixed width text here
italic fixed width text
html2text-2020.1.16/test/GoogleDocSaved_two.md 0000664 0001750 0001750 00000000000 13434106424 020364 0 ustar jon jon 0000000 0000000 html2text-2020.1.16/test/__init__.py 0000664 0001750 0001750 00000000000 13434106424 016435 0 ustar jon jon 0000000 0000000 html2text-2020.1.16/test/abbr_tag.html 0000664 0001750 0001750 00000000077 13434106424 016771 0 ustar jon jon 0000000 0000000 TLA xyz html2text-2020.1.16/test/abbr_tag.md 0000664 0001750 0001750 00000000051 13542620133 016413 0 ustar jon jon 0000000 0000000 TLA xyz *[TLA]: Three Letter Acronym html2text-2020.1.16/test/anchors.html 0000664 0001750 0001750 00000000427 13434106424 016664 0 ustar jon jon 0000000 0000000
Processing hyperlinks
Additional hyperlink tests!
Bold Linkfilename.py
The source code is calledmagic.py
html2text-2020.1.16/test/anchors.md 0000664 0001750 0001750 00000000320 13542620133 016306 0 ustar jon jon 0000000 0000000 # Processing hyperlinks Additional hyperlink tests! [**Bold Link**](http://some.link) [`filename.py`](http://some.link/filename.py) [The source code is called `magic.py`](http://some.link/magicsources.py) html2text-2020.1.16/test/apos_element.html 0000664 0001750 0001750 00000000065 13434106424 017700 0 ustar jon jon 0000000 0000000 ' html2text-2020.1.16/test/apos_element.md 0000664 0001750 0001750 00000000003 13542620133 017322 0 ustar jon jon 0000000 0000000 ' html2text-2020.1.16/test/blockquote_example.html 0000664 0001750 0001750 00000000333 13525235166 021115 0 ustar jon jon 0000000 0000000"The time has come", the Walrus said, "To talk of many things: Of shoes - and ships - and sealing wax - Of cabbages - and kings- And why the sea is boiling hot - And whether pigs have wings."html2text-2020.1.16/test/blockquote_example.md 0000664 0001750 0001750 00000000307 13542620133 020541 0 ustar jon jon 0000000 0000000 > "The time has come", the Walrus said, "To talk of many things: Of shoes - > and ships - and sealing wax - Of cabbages - and kings- And why the sea is > boiling hot - And whether pigs have wings." html2text-2020.1.16/test/bodywidth_newline.html 0000664 0001750 0001750 00000000557 13434106424 020751 0 ustar jon jon 0000000 0000000Another theory is that magician and occultist Aliester Crowley created the beast while attempting to summon evil spirits at his house on the edge of the lake in the early 1900′s. I met a local woman who prefers this explanation.
html2text-2020.1.16/test/bodywidth_newline.md 0000664 0001750 0001750 00000000434 13434106424 020377 0 ustar jon jon 0000000 0000000 Another theory is that magician and occultist [Aliester Crowley](http://en.wikipedia.org/wiki/Aleister_Crowley) created the beast while attempting to summon evil spirits at his house on the edge of the lake in the early 1900′s. **I met a local woman who prefers this explanation.** html2text-2020.1.16/test/bold_inside_link.html 0000664 0001750 0001750 00000000111 13434106424 020505 0 ustar jon jon 0000000 0000000 Text sample html2text-2020.1.16/test/bold_inside_link.md 0000664 0001750 0001750 00000000056 13542620133 020147 0 ustar jon jon 0000000 0000000 [**Text**](link.htm) [**sample**](/nothing/) html2text-2020.1.16/test/bold_long_line.html 0000664 0001750 0001750 00000000211 13434106424 020164 0 ustar jon jon 0000000 0000000text and a very long long long long long long long long long long long long long long long long long long long long line
html2text-2020.1.16/test/bold_long_line.md 0000664 0001750 0001750 00000000176 13542620133 017630 0 ustar jon jon 0000000 0000000 **text** and a very long long long long long long long long long long long long long long long long long long long long line html2text-2020.1.16/test/break_preserved_in_blockquote.html 0000664 0001750 0001750 00000000041 13434106424 023300 0 ustar jon jon 0000000 0000000 abhtml2text-2020.1.16/test/break_preserved_in_blockquote.md 0000664 0001750 0001750 00000000016 13542620133 022734 0 ustar jon jon 0000000 0000000 a > b > c html2text-2020.1.16/test/css_import_no_semicolon.html 0000664 0001750 0001750 00000000662 13542620133 022154 0 ustar jon jon 0000000 0000000
cNBSP handling test #1 CSS @import statement without semicolon handling test
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
html2text-2020.1.16/test/css_import_no_semicolon.md 0000664 0001750 0001750 00000000267 13542620133 021611 0 ustar jon jon 0000000 0000000 # CSS @import statement without semicolon handling test Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. html2text-2020.1.16/test/decript_tage.html 0000664 0001750 0001750 00000000101 13434106424 017646 0 ustar jon jon 0000000 0000000somethingsomethingsomethinghtml2text-2020.1.16/test/decript_tage.md 0000664 0001750 0001750 00000000053 13542620133 017306 0 ustar jon jon 0000000 0000000 ~~something~~ ~~something~~ ~~something~~ html2text-2020.1.16/test/default_image_alt.html 0000664 0001750 0001750 00000000076 13434106424 020655 0 ustar jon jon 0000000 0000000html2text-2020.1.16/test/default_image_alt.md 0000664 0001750 0001750 00000000062 13542620133 020302 0 ustar jon jon 0000000 0000000 [](http://google.com) html2text-2020.1.16/test/doc_with_table.html 0000664 0001750 0001750 00000001313 13434106424 020171 0 ustar jon jon 0000000 0000000
This is a test document
With some text,code
, bolds and italics.This is second header
Header 1 Header 2 Header 3 Content 1 Content 2 Image!
Content 1 Content 2 Image!