././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736445.076318 latexcodec-3.0.0/0000755005105600024240000000000014572100775013133 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/.coveragerc0000644005105600024240000000022014572077612015251 0ustar00dma0mtURP_dma[run] branch = True source = test,latexcodec [report] exclude_lines = pragma: no cover if TYPE_CHECKING: raise NotImplementedError ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/AUTHORS.rst0000644005105600024240000000076713774557275015042 0ustar00dma0mtURP_dmaMain authors: * David Eppstein - wrote the original LaTeX codec as a recipe on ActiveState http://code.activestate.com/recipes/252124-latex-codec/ * Peter Tröger - wrote the original latexcodec package, which contained a simple but very effective LaTeX encoder * Matthias Troffaes (matthias.troffaes@gmail.com) - wrote the lexer - integrated codec with the lexer for a simpler and more robust design - various bugfixes Contributors: * Michael Radziej * Philipp Spitzer ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736361.0 latexcodec-3.0.0/CHANGELOG.rst0000644005105600024240000000633514572100651015154 0ustar00dma0mtURP_dma3.0.0 (6 March 2024) -------------------- * Drop Python 2.7, 3.4, 3.5, and 3.6 support. Remove unneeded dependencies. * Add Python 3.11 and 3.12 support. * Added a few more translations. 2.0.1 (23 July 2020) -------------------- * Drop Python 3.3 support. * Added a few more translations. 2.0.0 (14 January 2020) ----------------------- * Lexer now processes unicode directly, to fix various issues with multibyte encodings. This also simplifies the implementation. Many thanks to davidweichiang for reporting and implementing. * New detailed description of the package for the readme, to clarify the behaviour and design choices. Many thanks to tschantzmc for contributing this description (see issue #70). * Minor fix in decoding of LaTeX comments (see issue #72). * Support Python 3.9 (see issue #75). 1.0.7 (3 May 2019) ------------------ * More symbols (THIN SPACE, various accented characters). * Fix lexer issue with multibyte encodings (thanks to davidweichiang for reporting). 1.0.6 (18 January 2018) ----------------------- * More symbols (EM SPACE, MINUS SIGN, GREEK PHI SYMBOL, HYPHEN, alternate encodings of Swedish å and Å). 1.0.5 (16 June 2017) -------------------- * More maths symbols (naturals, reals, ...). * Fix lower case z with accents (reported by AndrewSwann, see issue #51). 1.0.4 (21 September 2016) ------------------------- * Fix encoding and decoding of percent sign (reported by jgosmann, see issue #48). 1.0.3 (26 March 2016) --------------------- * New ``'keep'`` error for the ulatex encoder to keep unicode characters that cannot be translated (contributed by xuhdev, see pull requestion #45). 1.0.2 (1 March 2016) -------------------- * New ``ulatex`` codec which works as a text transform on unicode strings. * Fix spacing when translating math (see issue #29, reported by beltiste). * Performance improvements in latex to unicode translation. * Support old-style math mode (see pull request #40, contributed by xuhdev). * Treat tab character as a space character (see discussion in issue #40, raised by xuhdev). 1.0.1 (24 September 2014) ------------------------- * br"\\par" is now decoded using two newlines (see issue #26, reported by Jorrit Wronski). * Fix encoding and decoding of the ogonek (see issue #24, reported by beltiste). 1.0.0 (5 August 2014) --------------------- * Add Python 3.4 support. * Fix "DZ" decoding (see issue #21, reported and fixed by Philipp Spitzer). 0.3.2 (17 April 2014) --------------------- * Fix underscore "\\_" encoding (see issue #17, reported and fixed by Michael Radziej). 0.3.1 (5 February 2014) ----------------------- * Drop Python 3.2 support. * Drop 2to3 and instead use six to support both Python 2 and 3 from a single code base. * Fix control space "\\ " decoding. * Fix LaTeX encoding of number sign "#" and other special ascii characters (see issues #11 and #13, reported by beltiste). 0.3.0 (19 August 2013) ---------------------- * Copied lexer and codec from sphinxcontrib-bibtex. * Initial usage and API documentation. * Some small bugs fixed. 0.2 (28 September 2012) ----------------------- * Adding additional codec with brackets around special characters. 0.1 (26 May 2012) ----------------- * Initial release. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/INSTALL.rst0000644005105600024240000000622613774557275015017 0ustar00dma0mtURP_dmaInstall the module with ``pip install latexcodec``, or from source using ``python setup.py install``. Minimal Example --------------- Simply import the :mod:`latexcodec` module to enable ``"latex"`` to be used as an encoding: .. code-block:: python import latexcodec text_latex = b"\\'el\\`eve" assert text_latex.decode("latex") == u"élève" text_unicode = u"ångström" assert text_unicode.encode("latex") == b'\\aa ngstr\\"om' There are also a ``ulatex`` encoding for text transforms. The simplest way to use this codec goes through the codecs module (as for all text transform codecs on Python): .. code-block:: python import codecs import latexcodec text_latex = u"\\'el\\`eve" assert codecs.decode(text_latex, "ulatex") == u"élève" text_unicode = u"ångström" assert codecs.encode(text_unicode, "ulatex") == u'\\aa ngstr\\"om' By default, the LaTeX input is assumed to be ascii, as per standard LaTeX. However, you can also specify an extra codec as ``latex+`` or ``ulatex+``, where ```` describes another encoding. In this case characters will be translated to and from that encoding whenever possible. The following code snippet demonstrates this behaviour: .. code-block:: python import latexcodec text_latex = b"\xfe" assert text_latex.decode("latex+latin1") == u"þ" assert text_latex.decode("latex+latin2") == u"ţ" text_unicode = u"ţ" assert text_unicode.encode("latex+latin1") == b'\\c t' # ţ is not latin1 assert text_unicode.encode("latex+latin2") == b'\xfe' # but it is latin2 When encoding using the ``ulatex`` codec, you have the option to pass through characters that cannot be encoded in the desired encoding, by using the ``'keep'`` error. This can be a useful fallback option if you want to encode as much as possible, whilst still retaining as much as possible of the original code when encoding fails. If instead you want to translate to LaTeX but keep as much of the unicode as possible, use the ``ulatex+utf8`` codec, which should never fail. .. code-block:: python import codecs import latexcodec text_unicode = u'⌨' # \u2328 = keyboard symbol, currently not translated try: # raises a value error as \u2328 cannot be encoded into latex codecs.encode(text_unicode, "ulatex+ascii") except ValueError: pass assert codecs.encode(text_unicode, "ulatex+ascii", "keep") == u'⌨' assert codecs.encode(text_unicode, "ulatex+utf8") == u'⌨' Limitations ----------- * Not all unicode characters are registered. If you find any missing, please report them on the tracker: https://github.com/mcmtroffaes/latexcodec/issues * Unicode combining characters are currently not handled. * By design, the codec never removes curly brackets. This is because it is very hard to guess whether brackets are part of a command or not (this would require a full latex parser). Moreover, bibtex uses curly brackets as a guard against case conversion, in which case automatic removal of curly brackets may not be desired at all, even if they are not part of a command. Also see: http://stackoverflow.com/a/19754245/2863746 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/LICENSE.rst0000644005105600024240000000217113774557275014766 0ustar00dma0mtURP_dma| latexcodec is a lexer and codec to work with LaTeX code in Python | Copyright (c) 2011-2020 by Matthias C. M. Troffaes Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736230.0 latexcodec-3.0.0/MANIFEST.in0000644005105600024240000000053014572100446014662 0ustar00dma0mtURP_dmainclude VERSION include README.rst include INSTALL.rst include CHANGELOG.rst include LICENSE.rst include AUTHORS.rst recursive-include doc * recursive-include test * global-exclude *.pyc global-exclude .gitignore prune doc/_build exclude .travis.yml include latexcodec/table.txt include latexcodec/py.typed include mypy.ini include .coveragerc ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736445.074319 latexcodec-3.0.0/PKG-INFO0000644005105600024240000001136314572100775014234 0ustar00dma0mtURP_dmaMetadata-Version: 2.1 Name: latexcodec Version: 3.0.0 Summary: A lexer and codec to work with LaTeX code in Python. Home-page: https://github.com/mcmtroffaes/latexcodec Download-URL: http://pypi.python.org/pypi/latexcodec Author: Matthias C. M. Troffaes Author-email: matthias.troffaes@gmail.com License: MIT Platform: any Classifier: Development Status :: 5 - Production/Stable Classifier: Environment :: Console Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: MIT License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Topic :: Text Processing :: Markup :: LaTeX Classifier: Topic :: Text Processing :: Filters Requires-Python: >=3.7 License-File: LICENSE.rst License-File: AUTHORS.rst * **Instead of using latexcodec, I encourage you to consider pylatexenc instead, which is far superior:** https://github.com/phfaist/pylatexenc * Download: http://pypi.python.org/pypi/latexcodec/#downloads * Documentation: http://latexcodec.readthedocs.org/ * Development: http://github.com/mcmtroffaes/latexcodec/ .. |ci| image:: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml/badge.svg :target: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml :alt: ci .. |codecov| image:: https://codecov.io/gh/mcmtroffaes/latexcodec/branch/develop/graph/badge.svg :target: https://codecov.io/gh/mcmtroffaes/latexcodec :alt: codecov The codec provides a convenient way of going between text written in LaTeX and unicode. Since it is not a LaTeX compiler, it is more appropriate for short chunks of text, such as a paragraph or the values of a BibTeX entry, and it is not appropriate for a full LaTeX document. In particular, its behavior on the LaTeX commands that do not simply select characters is intended to allow the unicode representation to be understandable by a human reader, but is not canonical and may require hand tuning to produce the desired effect. The encoder does a best effort to replace unicode characters outside of the range used as LaTeX input (ascii by default) with a LaTeX command that selects the character. More technically, the unicode code point is replaced by a LaTeX command that selects a glyph that reasonably represents the code point. Unicode characters with special uses in LaTeX are replaced by their LaTeX equivalents. For example, ====================== =================== original text encoded LaTeX ====================== =================== ``¥`` ``\yen`` ``ü`` ``\"u`` ``\N{NO-BREAK SPACE}`` ``~`` ``~`` ``\textasciitilde`` ``%`` ``\%`` ``#`` ``\#`` ``\textbf{x}`` ``\textbf{x}`` ====================== =================== The decoder does a best effort to replace LaTeX commands that select characters with the unicode for the character they are selecting. For example, ===================== ====================== original LaTeX decoded unicode ===================== ====================== ``\yen`` ``¥`` ``\"u`` ``ü`` ``~`` ``\N{NO-BREAK SPACE}`` ``\textasciitilde`` ``~`` ``\%`` ``%`` ``\#`` ``#`` ``\textbf{x}`` ``\textbf {x}`` ``#`` ``#`` ===================== ====================== In addition, comments are dropped (including the final newline that marks the end of a comment), paragraphs are canonicalized into double newlines, and other newlines are left as is. Spacing after LaTeX commands is also canonicalized. For example, :: hi % bye there\par world \textbf {awesome} is decoded as :: hi there world \textbf {awesome} When decoding, LaTeX commands not directly selecting characters (for example, macros and formatting commands) are passed through unchanged. The same happens for LaTeX commands that select characters but are not yet recognized by the codec. Either case can result in a hybrid unicode string in which some characters are understood as literally the character and others as parts of unexpanded commands. Consequently, at times, backslashes will be left intact for denoting the start of a potentially unrecognized control sequence. Given the numerous and changing packages providing such LaTeX commands, the codec will never be complete, and new translations of unrecognized unicode or unrecognized LaTeX symbols are always welcome. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/README.rst0000644005105600024240000000734614572077612014637 0ustar00dma0mtURP_dmalatexcodec ========== |ci| |codecov| A lexer and codec to work with LaTeX code in Python. * **Instead of using latexcodec, I encourage you to consider pylatexenc instead, which is far superior:** https://github.com/phfaist/pylatexenc * Download: http://pypi.python.org/pypi/latexcodec/#downloads * Documentation: http://latexcodec.readthedocs.org/ * Development: http://github.com/mcmtroffaes/latexcodec/ .. |ci| image:: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml/badge.svg :target: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml :alt: ci .. |codecov| image:: https://codecov.io/gh/mcmtroffaes/latexcodec/branch/develop/graph/badge.svg :target: https://codecov.io/gh/mcmtroffaes/latexcodec :alt: codecov The codec provides a convenient way of going between text written in LaTeX and unicode. Since it is not a LaTeX compiler, it is more appropriate for short chunks of text, such as a paragraph or the values of a BibTeX entry, and it is not appropriate for a full LaTeX document. In particular, its behavior on the LaTeX commands that do not simply select characters is intended to allow the unicode representation to be understandable by a human reader, but is not canonical and may require hand tuning to produce the desired effect. The encoder does a best effort to replace unicode characters outside of the range used as LaTeX input (ascii by default) with a LaTeX command that selects the character. More technically, the unicode code point is replaced by a LaTeX command that selects a glyph that reasonably represents the code point. Unicode characters with special uses in LaTeX are replaced by their LaTeX equivalents. For example, ====================== =================== original text encoded LaTeX ====================== =================== ``¥`` ``\yen`` ``ü`` ``\"u`` ``\N{NO-BREAK SPACE}`` ``~`` ``~`` ``\textasciitilde`` ``%`` ``\%`` ``#`` ``\#`` ``\textbf{x}`` ``\textbf{x}`` ====================== =================== The decoder does a best effort to replace LaTeX commands that select characters with the unicode for the character they are selecting. For example, ===================== ====================== original LaTeX decoded unicode ===================== ====================== ``\yen`` ``¥`` ``\"u`` ``ü`` ``~`` ``\N{NO-BREAK SPACE}`` ``\textasciitilde`` ``~`` ``\%`` ``%`` ``\#`` ``#`` ``\textbf{x}`` ``\textbf {x}`` ``#`` ``#`` ===================== ====================== In addition, comments are dropped (including the final newline that marks the end of a comment), paragraphs are canonicalized into double newlines, and other newlines are left as is. Spacing after LaTeX commands is also canonicalized. For example, :: hi % bye there\par world \textbf {awesome} is decoded as :: hi there world \textbf {awesome} When decoding, LaTeX commands not directly selecting characters (for example, macros and formatting commands) are passed through unchanged. The same happens for LaTeX commands that select characters but are not yet recognized by the codec. Either case can result in a hybrid unicode string in which some characters are understood as literally the character and others as parts of unexpanded commands. Consequently, at times, backslashes will be left intact for denoting the start of a potentially unrecognized control sequence. Given the numerous and changing packages providing such LaTeX commands, the codec will never be complete, and new translations of unrecognized unicode or unrecognized LaTeX symbols are always welcome. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736375.0 latexcodec-3.0.0/VERSION0000644005105600024240000000000614572100667014177 0ustar00dma0mtURP_dma3.0.0 ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736444.842325 latexcodec-3.0.0/doc/0000755005105600024240000000000014572100775013700 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/Makefile0000644005105600024240000001271413774557275015363 0ustar00dma0mtURP_dma# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/latexcodec.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/latexcodec.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/latexcodec" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/latexcodec" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736444.870321 latexcodec-3.0.0/doc/api/0000755005105600024240000000000014572100775014451 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/api/codec.rst0000644005105600024240000000004113774557275016271 0ustar00dma0mtURP_dma.. automodule:: latexcodec.codec ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/api/lexer.rst0000644005105600024240000000004113774557275016333 0ustar00dma0mtURP_dma.. automodule:: latexcodec.lexer ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/api.rst0000644005105600024240000000010413774557275015214 0ustar00dma0mtURP_dmaAPI ~~~ .. toctree:: :maxdepth: 2 api/codec api/lexer ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/authors.rst0000644005105600024240000000005613774557275016136 0ustar00dma0mtURP_dmaAuthors ======= .. include:: ../AUTHORS.rst ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/changes.rst0000644005105600024240000000007613774557275016063 0ustar00dma0mtURP_dma:tocdepth: 1 Changes ======= .. include:: ../CHANGELOG.rst ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/doc/conf.py0000644005105600024240000000225414572077612015205 0ustar00dma0mtURP_dma# -*- coding: utf-8 -*- # # latexcodec documentation build configuration file, created by # sphinx-quickstart on Wed Aug 3 15:45:22 2011. extensions = [ 'sphinx.ext.autodoc', 'sphinx.ext.doctest', 'sphinx.ext.intersphinx', 'sphinx.ext.todo', 'sphinx.ext.coverage', 'sphinx.ext.imgmath', 'sphinx.ext.viewcode'] source_suffix = '.rst' master_doc = 'index' project = 'latexcodec' copyright = '2011-2014, Matthias C. M. Troffaes' with open("../VERSION") as version_file: release = version_file.read().strip() version = '.'.join(release.split('.')[:2]) exclude_patterns = ['_build'] pygments_style = 'sphinx' html_theme = 'default' htmlhelp_basename = 'latexcodecdoc' latex_documents = [ ('index', 'latexcodec.tex', 'latexcodec Documentation', 'Matthias C. M. Troffaes', 'manual'), ] man_pages = [ ('index', 'latexcodec', 'latexcodec Documentation', ['Matthias C. M. Troffaes'], 1) ] texinfo_documents = [ ('index', 'latexcodec', 'latexcodec Documentation', 'Matthias C. M. Troffaes', 'latexcodec', 'One line description of project.', 'Miscellaneous'), ] intersphinx_mapping = { 'python': ('http://docs.python.org/', None), } ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/index.rst0000644005105600024240000000047213774557275015562 0ustar00dma0mtURP_dmaWelcome to latexcodec's documentation! ====================================== :Release: |release| :Date: |today| Contents -------- .. toctree:: :maxdepth: 2 quickstart api changes authors license Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/license.rst0000644005105600024240000000044113774557275016071 0ustar00dma0mtURP_dmaLicense ======= .. include:: ../LICENSE.rst .. rubric:: Remark Versions 0.1 and 0.2 of the latexcodec package were written by Peter Tröger, and were released under the Academic Free License 3.0. The current version of the latexcodec package shares no code with those earlier versions. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/doc/make.bat0000644005105600024240000001176013774557275015330 0ustar00dma0mtURP_dma@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=_build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . set I18NSPHINXOPTS=%SPHINXOPTS% . if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\latexcodec.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\latexcodec.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) :end ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/doc/quickstart.rst0000644005105600024240000000073214572077612016631 0ustar00dma0mtURP_dmaGetting Started =============== Overview -------- .. include:: ../README.rst :start-line: 5 Installation ------------ .. include:: ../INSTALL.rst Related Projects ---------------- * `pylatexenc `_: A LaTeX parser providing fully customizable latex-to-unicode and unicode-to-latex translation. * `TexSoup `_: A LaTeX parser for searching, navigating, and modifying LaTeX documents. ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736444.939319 latexcodec-3.0.0/latexcodec/0000755005105600024240000000000014572100775015246 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1609752253.0 latexcodec-3.0.0/latexcodec/__init__.py0000644005105600024240000000006413774557275017375 0ustar00dma0mtURP_dmaimport latexcodec.codec latexcodec.codec.register() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/latexcodec/codec.py0000644005105600024240000004102314572077612016700 0ustar00dma0mtURP_dma""" LaTeX Codec ~~~~~~~~~~~ The :mod:`latexcodec.codec` module contains all classes and functions for LaTeX code translation. For practical use, you should only ever need to import the :mod:`latexcodec` module, which will automatically register the codec so it can be used by :meth:`str.encode`, :meth:`str.decode`, and any of the functions defined in the :mod:`codecs` module such as :func:`codecs.open` and so on. The other functions and classes are exposed in case someone would want to extend them. .. autofunction:: register .. autofunction:: find_latex .. autoclass:: LatexIncrementalEncoder :show-inheritance: :members: .. autoclass:: LatexIncrementalDecoder :show-inheritance: :members: .. autoclass:: LatexCodec :show-inheritance: :members: .. autoclass:: LatexUnicodeTable :members: """ # Copyright (c) 2003, 2008 David Eppstein # Copyright (c) 2011-2020 Matthias C. M. Troffaes # # Permission is hereby granted, free of charge, to any person # obtaining a copy of this software and associated documentation # files (the "Software"), to deal in the Software without # restriction, including without limitation the rights to use, # copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following # conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES # OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR # OTHER DEALINGS IN THE SOFTWARE. import codecs import dataclasses import unicodedata from typing import Optional, List, Union, Any, Iterator, Tuple, Type, Dict try: import importlib.resources as pkg_resources except ImportError: import importlib_resources as pkg_resources from latexcodec import lexer from codecs import CodecInfo def register(): """Register the :func:`find_latex` codec search function. .. seealso:: :func:`codecs.register` """ codecs.register(find_latex) # returns the codec search function # this is used if latex_codec.py were to be placed in stdlib def getregentry() -> Optional[CodecInfo]: """Encodings module API.""" return find_latex('latex') @dataclasses.dataclass class UnicodeLatexTranslation: unicode: str latex: str encode: bool #: Suitable for unicode -> latex. decode: bool #: Suitable for latex -> unicode. text_mode: bool #: Latex works in text mode. math_mode: bool #: Latex works in math mode. def load_unicode_latex_table() -> Iterator[UnicodeLatexTranslation]: with pkg_resources.open_text('latexcodec', 'table.txt') as datafile: for line in datafile: marker, unicode_names, latex = line.rstrip('\r\n').split('\u0009') unicode = ''.join( unicodedata.lookup(name) for name in unicode_names.split(',')) yield UnicodeLatexTranslation( unicode=unicode, latex=latex, encode=marker[1] in {'-', '>'}, decode=marker[1] in {'-', '<'}, text_mode=marker[0] in {'A', 'T'}, math_mode=marker[0] in {'A', 'M'}, ) class LatexUnicodeTable: """Tabulates a translation between LaTeX and unicode.""" def __init__(self, lexer_): self.lexer: lexer.LatexIncrementalLexer = lexer_ self.unicode_map: Dict[Tuple[lexer.Token, ...], str] = {} self.max_length: int = 0 self.latex_map: Dict[ str, Tuple[str, Tuple[lexer.Token, ...]]] = {} self.register_all() def register_all(self): """Register all symbols and their LaTeX equivalents (called by constructor). """ # register special symbols self.register(UnicodeLatexTranslation( unicode='\n\n', latex=' \\par', encode=False, decode=True, text_mode=True, math_mode=False, )) self.register(UnicodeLatexTranslation( unicode='\n\n', latex='\\par', encode=False, decode=True, text_mode=True, math_mode=False, )) for trans in load_unicode_latex_table(): self.register(trans) def register(self, trans: UnicodeLatexTranslation): """Register a correspondence between *unicode_text* and *latex_text*. :param UnicodeLatexTranslation trans: Description of translation. """ if trans.math_mode and not trans.text_mode: # also register text version self.register(UnicodeLatexTranslation( unicode=trans.unicode, latex='$' + trans.latex + '$', text_mode=True, math_mode=False, decode=trans.decode, encode=trans.encode, )) self.register(UnicodeLatexTranslation( unicode=trans.unicode, latex=r'\(' + trans.latex + r'\)', text_mode=True, math_mode=False, decode=trans.decode, encode=trans.encode, )) # for the time being, we do not perform in-math substitutions return # tokenize, and register unicode translation self.lexer.reset() self.lexer.state = 'M' tokens = tuple(self.lexer.get_tokens(trans.latex, final=True)) if trans.decode: if tokens not in self.unicode_map: self.max_length = max(self.max_length, len(tokens)) self.unicode_map[tokens] = trans.unicode # also register token variant with brackets, if appropriate # for instance, "\'{e}" for "\'e", "\c{c}" for "\c c", etc. # note: we do not remove brackets (they sometimes matter, # e.g. bibtex uses them to prevent lower case transformation) if (len(tokens) == 2 and tokens[0].name.startswith('control') and tokens[1].name == 'chars'): self.register(UnicodeLatexTranslation( unicode=f"{{{trans.unicode}}}", latex=f"{tokens[0].text}{{{tokens[1].text}}}", decode=True, encode=False, math_mode=trans.math_mode, text_mode=trans.text_mode, )) if (len(tokens) == 4 and tokens[0].text in {'$', r'\('} and tokens[1].name.startswith('control') and tokens[2].name == 'chars' and tokens[3].text in {'$', r'\)'}): # drop brackets in this case, since it is math mode self.register(UnicodeLatexTranslation( unicode=f"{trans.unicode}", latex=f"{tokens[0].text}{tokens[1].text}" f"{{{tokens[2].text}}}{tokens[3].text}", decode=True, encode=False, math_mode=trans.math_mode, text_mode=trans.text_mode, )) if trans.encode and trans.unicode not in self.latex_map: assert len(trans.unicode) == 1 self.latex_map[trans.unicode] = (trans.latex, tokens) _LATEX_UNICODE_TABLE = LatexUnicodeTable(lexer.LatexIncrementalDecoder()) # incremental encoder does not need a buffer # but decoder does class LatexIncrementalEncoder(lexer.LatexIncrementalEncoder): """Translating incremental encoder for latex. Maintains a state to determine whether control spaces etc. need to be inserted. """ emptytoken = lexer.Token("unknown", "") #: The empty token. table = _LATEX_UNICODE_TABLE #: Translation table. state: str def __init__(self, errors='strict'): super().__init__(errors=errors) self.reset() def reset(self): super(LatexIncrementalEncoder, self).reset() self.state = 'M' def get_space_bytes(self, bytes_: str) -> Tuple[str, str]: """Inserts space bytes in space eating mode.""" if self.state == 'S': # in space eating mode # control space needed? if bytes_.startswith(' '): # replace by control space return '\\ ', bytes_[1:] else: # insert space (it is eaten, but needed for separation) return ' ', bytes_ else: return '', bytes_ def _get_latex_chars_tokens_from_char( self, c: str) -> Tuple[str, Tuple[lexer.Token, ...]]: # if ascii, try latex equivalents # (this covers \, #, &, and other special LaTeX characters) if ord(c) < 128: try: return self.table.latex_map[c] except KeyError: pass # next, try input encoding try: c.encode(self.inputenc, 'strict') except UnicodeEncodeError: pass else: return c, (lexer.Token(name='chars', text=c),) # next, try latex equivalents of common unicode characters try: return self.table.latex_map[c] except KeyError: # translation failed if self.errors == 'strict': raise UnicodeEncodeError( "latex", # codec c, # problematic input 0, 1, # location of problematic character "don't know how to translate {0} into latex" .format(repr(c))) elif self.errors == 'ignore': return '', (self.emptytoken,) elif self.errors == 'replace': # use the \\char command # this assumes # \usepackage[T1]{fontenc} # \usepackage[utf8]{inputenc} bytes_ = '{\\char' + str(ord(c)) + '}' return bytes_, (lexer.Token(name='chars', text=bytes_),) elif self.errors == 'keep': return c, (lexer.Token(name='chars', text=c),) else: raise ValueError( "latex codec does not support {0} errors" .format(self.errors)) def get_latex_chars( self, unicode_: str, final: bool = False) -> Iterator[str]: if not isinstance(unicode_, str): raise TypeError( "expected unicode for encode input, but got {0} instead" .format(unicode_.__class__.__name__)) # convert character by character for pos, c in enumerate(unicode_): bytes_, tokens = self._get_latex_chars_tokens_from_char(c) space, bytes_ = self.get_space_bytes(bytes_) # update state if tokens and tokens[-1].name == 'control_word': # we're eating spaces self.state = 'S' elif tokens: self.state = 'M' if space: yield space yield bytes_ class LatexIncrementalDecoder(lexer.LatexIncrementalDecoder): """Translating incremental decoder for LaTeX.""" table = _LATEX_UNICODE_TABLE #: Translation table. token_buffer: List[lexer.Token] #: The token buffer of this decoder. def __init__(self, errors='strict'): lexer.LatexIncrementalDecoder.__init__(self, errors=errors) def reset(self): lexer.LatexIncrementalDecoder.reset(self) self.token_buffer = [] # python codecs API does not support multibuffer incremental decoders def getstate(self) -> Any: raise NotImplementedError def setstate(self, state: Any) -> None: raise NotImplementedError def get_unicode_tokens(self, chars: str, final: bool = False ) -> Iterator[str]: for token in self.get_tokens(chars, final=final): # at this point, token_buffer does not match anything self.token_buffer.append(token) # new token appended at the end, see if we have a match now # note: match is only possible at the *end* of the buffer # because all other positions have already been checked in # earlier iterations for i in range(len(self.token_buffer), 0, -1): last_tokens = tuple(self.token_buffer[-i:]) # last i tokens try: unicode_text = self.table.unicode_map[last_tokens] except KeyError: # no match: continue continue else: # match!! flush buffer, and translate last bit # exclude last i tokens for token2 in self.token_buffer[:-i]: yield self.decode_token(token2) yield unicode_text self.token_buffer = [] break # flush tokens that can no longer match while len(self.token_buffer) >= self.table.max_length: yield self.decode_token(self.token_buffer.pop(0)) # also flush the buffer at the end if final: for token in self.token_buffer: yield self.decode_token(token) self.token_buffer = [] class LatexCodec(codecs.Codec): IncrementalEncoder: Type[LatexIncrementalEncoder] IncrementalDecoder: Type[LatexIncrementalDecoder] def encode(self, unicode_: str, errors='strict' # type: ignore ) -> Tuple[Union[bytes, str], int]: """Convert unicode string to LaTeX bytes.""" encoder = self.IncrementalEncoder(errors=errors) return encoder.encode(unicode_, final=True), len(unicode_) def decode(self, bytes_: Union[bytes, str], errors='strict' ) -> Tuple[str, int]: """Convert LaTeX bytes to unicode string.""" decoder = self.IncrementalDecoder(errors=errors) return decoder.decode(bytes_, final=True), len(bytes_) # type: ignore class UnicodeLatexIncrementalDecoder(LatexIncrementalDecoder): def decode(self, bytes_: str, final: bool = False) -> str: # type: ignore return self.udecode(bytes_, final) class UnicodeLatexIncrementalEncoder(LatexIncrementalEncoder): def encode(self, unicode_: str, final: bool = False # type: ignore ) -> str: return self.uencode(unicode_, final) def find_latex(encoding: str) -> Optional[CodecInfo]: """Return a :class:`codecs.CodecInfo` instance for the requested LaTeX *encoding*, which must be equal to ``latex``, or to ``latex+`` where ```` describes another encoding. """ IncEnc: Type[LatexIncrementalEncoder] IncDec: Type[LatexIncrementalDecoder] if '_' in encoding: # Python 3.9 now normalizes "latex+latin1" to "latex_latin1" # https://bugs.python.org/issue37751 encoding, _, inputenc_ = encoding.partition("_") else: encoding, _, inputenc_ = encoding.partition("+") if not inputenc_: inputenc_ = "ascii" if encoding == "latex": incremental_encoder = type( "incremental_encoder", (LatexIncrementalEncoder,), dict(inputenc=inputenc_)) incremental_decoder = type( "incremental_encoder", (LatexIncrementalDecoder,), dict(inputenc=inputenc_)) elif encoding == "ulatex": incremental_encoder = type( "incremental_encoder", (UnicodeLatexIncrementalEncoder,), dict(inputenc=inputenc_)) incremental_decoder = type( "incremental_encoder", (UnicodeLatexIncrementalDecoder,), dict(inputenc=inputenc_)) else: return None class Codec(LatexCodec): IncrementalEncoder = incremental_encoder IncrementalDecoder = incremental_decoder class StreamWriter(Codec, codecs.StreamWriter): pass class StreamReader(Codec, codecs.StreamReader): pass return codecs.CodecInfo( encode=Codec().encode, # type: ignore decode=Codec().decode, # type: ignore incrementalencoder=Codec.IncrementalEncoder, incrementaldecoder=Codec.IncrementalDecoder, streamreader=StreamReader, streamwriter=StreamWriter, ) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/latexcodec/lexer.py0000644005105600024240000004231514572077612016747 0ustar00dma0mtURP_dma""" LaTeX Lexer ~~~~~~~~~~~ This module contains all classes for lexing LaTeX code, as well as general purpose base classes for incremental LaTeX decoders and encoders, which could be useful in case you are writing your own custom LaTeX codec. .. autoclass:: Token(name, text) .. autoclass:: LatexLexer :show-inheritance: :members: .. autoclass:: LatexIncrementalLexer :show-inheritance: :members: .. autoclass:: LatexIncrementalDecoder :show-inheritance: :members: .. autoclass:: LatexIncrementalEncoder :show-inheritance: :members: """ # Copyright (c) 2003, 2008 David Eppstein # Copyright (c) 2011-2020 Matthias C. M. Troffaes # # Permission is hereby granted, free of charge, to any person # obtaining a copy of this software and associated documentation # files (the "Software"), to deal in the Software without # restriction, including without limitation the rights to use, # copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the # Software is furnished to do so, subject to the following # conditions: # # The above copyright notice and this permission notice shall be # included in all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, # EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES # OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND # NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT # HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, # WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING # FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR # OTHER DEALINGS IN THE SOFTWARE. import codecs import re import unicodedata from abc import ABC, ABCMeta from typing import Iterator, Tuple, Sequence, Any, NamedTuple class Token(NamedTuple): name: str text: str # implementation note: we derive from IncrementalDecoder because this # class serves excellently as a base class for incremental decoders, # but of course we don't decode yet until later class MetaRegexpLexer(ABCMeta): """Metaclass for :class:`RegexpLexer`. Compiles tokens into a regular expression. """ def __init__(cls, name, bases, dct): super().__init__(name, bases, dct) regexp_string = "|".join( "(?P<" + name + ">" + regexp + ")" for name, regexp in getattr(cls, "tokens", [])) cls.regexp = re.compile(regexp_string, re.DOTALL) class RegexpLexer(codecs.IncrementalDecoder, metaclass=MetaRegexpLexer): """Abstract base class for regexp based lexers.""" emptytoken = Token("unknown", "") #: The empty token. tokens: Sequence[Tuple[str, str]] = () #: Sequence of token regexps. errors: str #: How to respond to errors. raw_buffer: Token #: The raw buffer of this lexer. regexp: Any #: Compiled regular expression. def __init__(self, errors: str = 'strict') -> None: """Initialize the codec.""" super().__init__(errors=errors) self.errors = errors self.reset() def reset(self) -> None: """Reset state.""" # buffer for storing last (possibly incomplete) token self.raw_buffer = self.emptytoken def getstate(self) -> Any: """Get state.""" return self.raw_buffer.text, 0 def setstate(self, state: Any) -> None: """Set state. The *state* must correspond to the return value of a previous :meth:`getstate` call. """ self.raw_buffer = Token('unknown', state[0]) def get_raw_tokens(self, chars: str, final: bool = False ) -> Iterator[Token]: """Yield tokens without any further processing. Tokens are one of: - ``\\``: a control word (i.e. a command) - ``\\``: a control symbol (i.e. \\^ etc.) - ``#``: a parameter - a series of byte characters """ if self.raw_buffer.text: chars = self.raw_buffer.text + chars self.raw_buffer = self.emptytoken for match in self.regexp.finditer(chars): # yield the buffer token if self.raw_buffer.text: yield self.raw_buffer # fill buffer with next token assert match.lastgroup is not None self.raw_buffer = Token(match.lastgroup, match.group(0)) if final: for token in self.flush_raw_tokens(): yield token def flush_raw_tokens(self) -> Iterator[Token]: """Flush the raw token buffer.""" if self.raw_buffer.text: yield self.raw_buffer self.raw_buffer = self.emptytoken class LatexLexer(RegexpLexer, ABC): """A very simple lexer for tex/latex.""" # implementation note: every token **must** be decodable by inputenc tokens = [ # match newlines and percent first, to ensure comments match correctly ('control_symbol_x2', r'[\\][\\]|[\\]%'), # comment: for ease, and for speed, we handle it as a token ('comment', r'%[^\n]*'), # control tokens # in latex, some control tokens skip following whitespace # ('control-word' and 'control-symbol') # others do not ('control-symbol-x') # XXX TBT says no control symbols skip whitespace (except '\ ') # XXX but tests reveal otherwise? ('control_word', r'[\\][a-zA-Z]+'), ('control_symbol', r'[\\][~' r"'" r'"` =^!.]'), # TODO should only match ascii ('control_symbol_x', r'[\\][^a-zA-Z]'), # parameter tokens # also support a lone hash so we can lex things like '#a' ('parameter', r'\#[0-9]|\#'), # any remaining characters; for ease we also handle space and # newline as tokens # XXX TBT does not mention \t to be a space character as well # XXX but tests reveal otherwise? ('space', r' |\t'), ('newline', r'\n'), ('mathshift', r'[$][$]|[$]'), # note: some chars joined together to make it easier to detect # symbols that have a special function (i.e. --, ---, etc.) ('chars', r'---|--|-|[`][`]' r"|['][']" r'|[?][`]|[!][`]' # separate chars because brackets are optional # e.g. fran\\c cais = fran\\c{c}ais in latex # so only way to detect \\c acting on c only is this way r'|(?![ %#$\n\t\\]).'), # trailing garbage which we cannot decode otherwise # (such as a lone '\' at the end of a buffer) # is never emitted, but used internally by the buffer ('unknown', r'.'), ] """List of token names, and the regular expressions they match.""" class LatexIncrementalLexer(LatexLexer, ABC): """A very simple incremental lexer for tex/latex code. Roughly follows the state machine described in Tex By Topic, Chapter 2. The generated tokens satisfy: * no newline characters: paragraphs are separated by '\\par' * spaces following control tokens are compressed """ partoken = Token(u"control_word", u"\\par") spacetoken = Token(u"space", u" ") replacetoken = Token(u"chars", u"\ufffd") curlylefttoken = Token(u"chars", u"{") curlyrighttoken = Token(u"chars", u"}") state: str inline_math: bool def reset(self) -> None: super().reset() # three possible states: # newline (N), skipping spaces (S), and middle of line (M) self.state = 'N' # inline math mode? self.inline_math = False def getstate(self) -> Any: # state 'M' is most common, so let that be zero return ( self.raw_buffer, {'M': 0, 'N': 1, 'S': 2}[self.state] | (4 if self.inline_math else 0) ) def setstate(self, state: Any): self.raw_buffer = state[0] self.state = {0: 'M', 1: 'N', 2: 'S'}[state[1] & 3] self.inline_math = bool(state[1] & 4) def get_tokens(self, chars: str, final: bool = False) -> Iterator[Token]: """Yield tokens while maintaining a state. Also skip whitespace after control words and (some) control symbols. Replaces newlines by spaces and \\par commands depending on the context. """ # current position relative to the start of chars in the sequence # of bytes that have been decoded pos = -len(self.raw_buffer.text) for token in self.get_raw_tokens(chars, final=final): pos = pos + len(token.text) assert pos >= 0 # first token includes at least self.raw_buffer if token.name == 'newline': if self.state == 'N': # if state was 'N', generate new paragraph yield self.partoken elif self.state == 'S': # switch to 'N' state, do not generate a space self.state = 'N' elif self.state == 'M': # switch to 'N' state, generate a space self.state = 'N' yield self.spacetoken else: raise AssertionError( "unknown tex state {0!r}".format(self.state)) elif token.name == 'space': if self.state == 'N': # remain in 'N' state, no space token generated pass elif self.state == 'S': # remain in 'S' state, no space token generated pass elif self.state == 'M': # in M mode, generate the space, # but switch to space skip mode self.state = 'S' yield token else: raise AssertionError( "unknown state {0!r}".format(self.state)) elif token.name == 'mathshift': self.inline_math = not self.inline_math self.state = 'M' yield token elif token.name == 'parameter': self.state = 'M' yield token elif token.name == 'control_word': # go to space skip mode self.state = 'S' yield token elif token.name == 'control_symbol': # go to space skip mode self.state = 'S' yield token elif (token.name == 'control_symbol_x' or token.name == 'control_symbol_x2'): # don't skip following space, so go to M mode self.state = 'M' yield token elif token.name == 'comment': # no token is generated # note: comment does not include the newline self.state = 'S' elif token.name == 'chars': self.state = 'M' yield token elif token.name == 'unknown': if self.errors == 'strict': # current position within chars # this is the position right after the unknown token raise UnicodeDecodeError( "latex", # codec chars.encode('utf8'), # problematic input pos - len(token.text), # start of problematic token pos, # end of it "unknown token {0!r}".format(token.text)) elif self.errors == 'ignore': # do nothing pass elif self.errors == 'replace': yield self.replacetoken else: raise NotImplementedError( "error mode {0!r} not supported".format(self.errors)) else: raise AssertionError( "unknown token name {0!r}".format(token.name)) class LatexIncrementalDecoder(LatexIncrementalLexer): """Simple incremental decoder. Transforms lexed LaTeX tokens into unicode. To customize decoding, subclass and override :meth:`get_unicode_tokens`. """ inputenc = "ascii" """Input encoding. **Must** extend ascii.""" def __init__(self, errors: str = 'strict') -> None: super(LatexIncrementalDecoder, self).__init__(errors) self.decoder = codecs.getincrementaldecoder(self.inputenc)(errors) def decode_token(self, token: Token) -> str: """Returns the decoded token text. .. note:: Control words get an extra space added at the back to make sure separation from the next token, so that decoded token sequences can be joined together. For example, the tokens ``'\\hello'`` and ``'world'`` will correctly result in ``'\\hello world'`` (remember that LaTeX eats space following control words). If no space were added, this would wrongfully result in ``'\\helloworld'``. """ text = token.text return text if token.name != u'control_word' else text + u' ' def get_unicode_tokens(self, chars: str, final: bool = False ) -> Iterator[str]: """Decode every token. Override to process the tokens in some other way (for example, for token translation). """ for token in self.get_tokens(chars, final=final): yield self.decode_token(token) def udecode(self, bytes_: str, final: bool = False) -> str: """Decode LaTeX *bytes_* into a unicode string. This implementation calls :meth:`get_unicode_tokens` and joins the resulting unicode strings together. """ return ''.join(self.get_unicode_tokens(bytes_, final=final)) def decode(self, bytes_: bytes, final: bool = False) -> str: """Decode LaTeX *bytes_* into a unicode string. Implementation uses :meth:`udecode`. """ try: chars = self.decoder.decode(bytes_, final=final) except UnicodeDecodeError as e: # API requires that the encode method raises a ValueError # in this case raise ValueError(e) return self.udecode(chars, final) class LatexIncrementalEncoder(codecs.IncrementalEncoder): """Simple incremental encoder for LaTeX. Transforms unicode into :class:`bytes`. To customize decoding, subclass and override :meth:`get_latex_bytes`. """ inputenc = "ascii" """Input encoding. **Must** extend ascii.""" buffer: str def __init__(self, errors: str = 'strict') -> None: """Initialize the codec.""" super().__init__(errors=errors) self.errors = errors self.reset() def reset(self) -> None: """Reset state.""" # buffer for storing last (possibly incomplete) token self.buffer = u"" def getstate(self) -> Any: """Get state.""" return self.buffer def setstate(self, state: Any) -> None: """Set state. The *state* must correspond to the return value of a previous :meth:`getstate` call. """ self.buffer = state def get_unicode_tokens(self, unicode_: str, final: bool = False ) -> Iterator[str]: """Split unicode into tokens so that every token starts with a non-combining character. """ if not isinstance(unicode_, str): raise TypeError( "expected unicode for encode input, but got {0} instead" .format(unicode_.__class__.__name__)) for c in unicode_: if not unicodedata.combining(c): for token in self.flush_unicode_tokens(): yield token self.buffer += c if final: for token in self.flush_unicode_tokens(): yield token def flush_unicode_tokens(self) -> Iterator[str]: """Flush the buffer.""" if self.buffer: yield self.buffer self.buffer = u"" def get_latex_chars(self, unicode_: str, final: bool = False ) -> Iterator[str]: """Encode every character. Override to process the unicode in some other way (for example, for character translation). """ for token in self.get_unicode_tokens(unicode_, final=final): yield token def uencode(self, unicode_: str, final: bool = False) -> str: """Encode the *unicode_* string into LaTeX :class:`bytes`. This implementation calls :meth:`get_latex_chars` and joins the resulting :class:`bytes` together. """ return ''.join(self.get_latex_chars(unicode_, final=final)) def encode(self, unicode_: str, final: bool = False) -> bytes: """Encode the *unicode_* string into LaTeX :class:`bytes`. This implementation calls :meth:`get_latex_chars` and joins the resulting :class:`bytes` together. """ chars = self.uencode(unicode_, final) try: return chars.encode(self.inputenc, self.errors) except UnicodeEncodeError as e: # API requires that the encode method raises a ValueError # in this case raise ValueError(e) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/latexcodec/py.typed0000644005105600024240000000000014572077612016736 0ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/latexcodec/table.txt0000644005105600024240000003644514572077612017115 0ustar00dma0mtURP_dmaA< SPACE \ A- EM SPACE \quad A> THIN SPACE A> ZERO WIDTH JOINER A> ZERO WIDTH NON-JOINER {} A> ZERO WIDTH SPACE \hspace{0pt} A- PERCENT SIGN \% T- EN DASH -- T- EN DASH \textendash T- EM DASH --- T- EM DASH \textemdash A> REPLACEMENT CHARACTER ???? T> LEFT SINGLE QUOTATION MARK ` T> RIGHT SINGLE QUOTATION MARK ' T- LEFT DOUBLE QUOTATION MARK `` T- RIGHT DOUBLE QUOTATION MARK '' T- DOUBLE LOW-9 QUOTATION MARK ,, T< DOUBLE LOW-9 QUOTATION MARK \glqq T- LEFT-POINTING DOUBLE ANGLE QUOTATION MARK \guillemotleft T- RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK \guillemotright T> MODIFIER LETTER PRIME ' T> MODIFIER LETTER DOUBLE PRIME '' T> MODIFIER LETTER TURNED COMMA ` T> MODIFIER LETTER APOSTROPHE ' T> MODIFIER LETTER REVERSED COMMA ` T- DAGGER \dag T- DOUBLE DAGGER \ddag T< REVERSE SOLIDUS \textbackslash M< REVERSE SOLIDUS \backslash M- TILDE OPERATOR \sim T- MODIFIER LETTER LOW TILDE \texttildelow T- SMALL TILDE \~{} T- TILDE \textasciitilde M- BULLET \bullet T- BULLET \textbullet M- ASTERISK OPERATOR \ast T- NUMBER SIGN \# T- LOW LINE \_ T- AMPERSAND \& T- NO-BREAK SPACE ~ T- INVERTED EXCLAMATION MARK !` T- CENT SIGN \not{c} T- POUND SIGN \pounds T- POUND SIGN \textsterling T- YEN SIGN \yen T- YEN SIGN \textyen T- SECTION SIGN \S T- DIAERESIS \"{} T- NOT SIGN \neg T> HYPHEN - T- SOFT HYPHEN \- T- MACRON \={} M- DEGREE SIGN ^\circ T- DEGREE SIGN \textdegree M- MINUS SIGN - M- PLUS-MINUS SIGN \pm T- PLUS-MINUS SIGN \textpm M- SUPERSCRIPT TWO ^2 T- SUPERSCRIPT TWO \texttwosuperior M- SUPERSCRIPT THREE ^3 T- SUPERSCRIPT THREE \textthreesuperior T- ACUTE ACCENT \'{} M- MICRO SIGN \mu T- MICRO SIGN \micro T- PILCROW SIGN \P M- MIDDLE DOT \cdot T- MIDDLE DOT \textperiodcentered T- CEDILLA \c{} M- SUPERSCRIPT ONE ^1 T- SUPERSCRIPT ONE \textonesuperior T- INVERTED QUESTION MARK ?` A- LATIN CAPITAL LETTER A WITH GRAVE \`A A- LATIN CAPITAL LETTER A WITH CIRCUMFLEX \^A A- LATIN CAPITAL LETTER A WITH TILDE \~A A- LATIN CAPITAL LETTER A WITH DIAERESIS \"A A- LATIN CAPITAL LETTER A WITH RING ABOVE \AA A< LATIN CAPITAL LETTER A WITH RING ABOVE \r A A- LATIN CAPITAL LETTER AE \AE A- LATIN CAPITAL LETTER C WITH CEDILLA \c C A- LATIN CAPITAL LETTER E WITH GRAVE \`E A- LATIN CAPITAL LETTER E WITH ACUTE \'E A- LATIN CAPITAL LETTER E WITH CIRCUMFLEX \^E A- LATIN CAPITAL LETTER E WITH DIAERESIS \"E A- LATIN CAPITAL LETTER I WITH GRAVE \`I A- LATIN CAPITAL LETTER I WITH CIRCUMFLEX \^I A- LATIN CAPITAL LETTER I WITH DIAERESIS \"I A- LATIN CAPITAL LETTER N WITH TILDE \~N A- LATIN CAPITAL LETTER O WITH GRAVE \`O A- LATIN CAPITAL LETTER O WITH ACUTE \'O A- LATIN CAPITAL LETTER O WITH CIRCUMFLEX \^O A- LATIN CAPITAL LETTER O WITH TILDE \~O A- LATIN CAPITAL LETTER O WITH DIAERESIS \"O M- MULTIPLICATION SIGN \times A- LATIN CAPITAL LETTER O WITH STROKE \O A- LATIN CAPITAL LETTER U WITH GRAVE \`U A- LATIN CAPITAL LETTER U WITH ACUTE \'U A- LATIN CAPITAL LETTER U WITH CIRCUMFLEX \^U A- LATIN CAPITAL LETTER U WITH DIAERESIS \"U A- LATIN CAPITAL LETTER Y WITH ACUTE \'Y A- LATIN SMALL LETTER SHARP S \ss A- LATIN SMALL LETTER A WITH GRAVE \`a A- LATIN SMALL LETTER A WITH ACUTE \'a A- LATIN SMALL LETTER A WITH CIRCUMFLEX \^a A- LATIN SMALL LETTER A WITH TILDE \~a A- LATIN SMALL LETTER A WITH DIAERESIS \"a A- LATIN SMALL LETTER A WITH RING ABOVE \aa A< LATIN SMALL LETTER A WITH RING ABOVE \r a A- LATIN SMALL LETTER AE \ae A- LATIN SMALL LETTER C WITH CEDILLA \c c A- LATIN SMALL LETTER E WITH GRAVE \`e A- LATIN SMALL LETTER E WITH ACUTE \'e A- LATIN SMALL LETTER E WITH CIRCUMFLEX \^e A- LATIN SMALL LETTER E WITH DIAERESIS \"e A- LATIN SMALL LETTER I WITH GRAVE \`\i A- LATIN SMALL LETTER I WITH GRAVE \`i A- LATIN SMALL LETTER I WITH ACUTE \'\i A- LATIN SMALL LETTER I WITH ACUTE \'i A- LATIN SMALL LETTER I WITH CIRCUMFLEX \^\i A- LATIN SMALL LETTER I WITH CIRCUMFLEX \^i A- LATIN SMALL LETTER I WITH DIAERESIS \"\i A- LATIN SMALL LETTER I WITH DIAERESIS \"i A- LATIN SMALL LETTER N WITH TILDE \~n A- LATIN SMALL LETTER O WITH GRAVE \`o A- LATIN SMALL LETTER O WITH ACUTE \'o A- LATIN SMALL LETTER O WITH CIRCUMFLEX \^o A- LATIN SMALL LETTER O WITH TILDE \~o A- LATIN SMALL LETTER O WITH DIAERESIS \"o M- DIVISION SIGN \div A- LATIN SMALL LETTER O WITH STROKE \o A- LATIN SMALL LETTER U WITH GRAVE \`u A- LATIN SMALL LETTER U WITH ACUTE \'u A- LATIN SMALL LETTER U WITH CIRCUMFLEX \^u A- LATIN SMALL LETTER U WITH DIAERESIS \"u A- LATIN SMALL LETTER Y WITH ACUTE \'y A- LATIN SMALL LETTER Y WITH DIAERESIS \"y A- LATIN CAPITAL LETTER A WITH MACRON \=A A- LATIN SMALL LETTER A WITH MACRON \=a A- LATIN CAPITAL LETTER A WITH BREVE \u A A- LATIN SMALL LETTER A WITH BREVE \u a A- LATIN CAPITAL LETTER A WITH OGONEK \k A A- LATIN SMALL LETTER A WITH OGONEK \k a A- LATIN CAPITAL LETTER C WITH ACUTE \'C A- LATIN SMALL LETTER C WITH ACUTE \'c A- LATIN CAPITAL LETTER C WITH CIRCUMFLEX \^C A- LATIN SMALL LETTER C WITH CIRCUMFLEX \^c A- LATIN CAPITAL LETTER C WITH DOT ABOVE \.C A- LATIN SMALL LETTER C WITH DOT ABOVE \.c A- LATIN CAPITAL LETTER C WITH CARON \v C A- LATIN SMALL LETTER C WITH CARON \v c A- LATIN CAPITAL LETTER D WITH CARON \v D A- LATIN SMALL LETTER D WITH CARON \v d A- LATIN CAPITAL LETTER E WITH MACRON \=E A- LATIN SMALL LETTER E WITH MACRON \=e A- LATIN CAPITAL LETTER E WITH BREVE \u E A- LATIN SMALL LETTER E WITH BREVE \u e A- LATIN CAPITAL LETTER E WITH DOT ABOVE \.E A- LATIN SMALL LETTER E WITH DOT ABOVE \.e A- LATIN CAPITAL LETTER E WITH OGONEK \k E A- LATIN SMALL LETTER E WITH OGONEK \k e A- LATIN CAPITAL LETTER E WITH CARON \v E A- LATIN SMALL LETTER E WITH CARON \v e A- LATIN CAPITAL LETTER G WITH CIRCUMFLEX \^G A- LATIN SMALL LETTER G WITH CIRCUMFLEX \^g A- LATIN CAPITAL LETTER G WITH BREVE \u G A- LATIN SMALL LETTER G WITH BREVE \u g A- LATIN CAPITAL LETTER G WITH DOT ABOVE \.G A- LATIN SMALL LETTER G WITH DOT ABOVE \.g A- LATIN CAPITAL LETTER G WITH CEDILLA \c G A- LATIN SMALL LETTER G WITH CEDILLA \c g A- LATIN CAPITAL LETTER H WITH CIRCUMFLEX \^H A- LATIN SMALL LETTER H WITH CIRCUMFLEX \^h A- LATIN CAPITAL LETTER I WITH TILDE \~I A- LATIN SMALL LETTER I WITH TILDE \~\i A- LATIN SMALL LETTER I WITH TILDE \~i A- LATIN CAPITAL LETTER I WITH MACRON \=I A- LATIN SMALL LETTER I WITH MACRON \=\i A- LATIN SMALL LETTER I WITH MACRON \=i A- LATIN CAPITAL LETTER I WITH BREVE \u I A- LATIN SMALL LETTER I WITH BREVE \u\i A- LATIN SMALL LETTER I WITH BREVE \u i A- LATIN CAPITAL LETTER I WITH OGONEK \k I A- LATIN SMALL LETTER I WITH OGONEK \k i A- LATIN CAPITAL LETTER I WITH DOT ABOVE \.I A- LATIN SMALL LETTER DOTLESS I \i A> LATIN CAPITAL LIGATURE IJ IJ A> LATIN SMALL LIGATURE IJ ij A- LATIN CAPITAL LETTER J WITH CIRCUMFLEX \^J A- LATIN SMALL LETTER J WITH CIRCUMFLEX \^\j A- LATIN SMALL LETTER J WITH CIRCUMFLEX \^j A- LATIN CAPITAL LETTER K WITH CEDILLA \c K A- LATIN SMALL LETTER K WITH CEDILLA \c k A- LATIN CAPITAL LETTER L WITH ACUTE \'L A- LATIN SMALL LETTER L WITH ACUTE \'l A- LATIN CAPITAL LETTER L WITH CEDILLA \c L A- LATIN SMALL LETTER L WITH CEDILLA \c l A- LATIN CAPITAL LETTER L WITH CARON \v L A- LATIN SMALL LETTER L WITH CARON \v l A- LATIN CAPITAL LETTER L WITH STROKE \L A- LATIN SMALL LETTER L WITH STROKE \l A- LATIN CAPITAL LETTER N WITH ACUTE \'N A- LATIN SMALL LETTER N WITH ACUTE \'n A- LATIN CAPITAL LETTER N WITH CEDILLA \c N A- LATIN SMALL LETTER N WITH CEDILLA \c n A- LATIN CAPITAL LETTER N WITH CARON \v N A- LATIN SMALL LETTER N WITH CARON \v n A- LATIN CAPITAL LETTER O WITH MACRON \=O A- LATIN SMALL LETTER O WITH MACRON \=o A- LATIN CAPITAL LETTER O WITH BREVE \u O A- LATIN SMALL LETTER O WITH BREVE \u o A- LATIN CAPITAL LETTER O WITH DOUBLE ACUTE \H O A- LATIN SMALL LETTER O WITH DOUBLE ACUTE \H o A- LATIN CAPITAL LIGATURE OE \OE A- LATIN SMALL LIGATURE OE \oe A- LATIN CAPITAL LETTER R WITH ACUTE \'R A- LATIN SMALL LETTER R WITH ACUTE \'r A- LATIN CAPITAL LETTER R WITH CEDILLA \c R A- LATIN SMALL LETTER R WITH CEDILLA \c r A- LATIN CAPITAL LETTER R WITH CARON \v R A- LATIN SMALL LETTER R WITH CARON \v r A- LATIN CAPITAL LETTER S WITH ACUTE \'S A- LATIN SMALL LETTER S WITH ACUTE \'s A- LATIN CAPITAL LETTER S WITH CIRCUMFLEX \^S A- LATIN SMALL LETTER S WITH CIRCUMFLEX \^s A- LATIN CAPITAL LETTER S WITH CEDILLA \c S A- LATIN SMALL LETTER S WITH CEDILLA \c s A- LATIN CAPITAL LETTER S WITH CARON \v S A- LATIN SMALL LETTER S WITH CARON \v s A- LATIN CAPITAL LETTER T WITH CEDILLA \c T A- LATIN SMALL LETTER T WITH CEDILLA \c t A- LATIN CAPITAL LETTER T WITH CARON \v T A- LATIN SMALL LETTER T WITH CARON \v t A- LATIN CAPITAL LETTER U WITH TILDE \~U A- LATIN SMALL LETTER U WITH TILDE \~u A- LATIN CAPITAL LETTER U WITH MACRON \=U A- LATIN SMALL LETTER U WITH MACRON \=u A- LATIN CAPITAL LETTER U WITH BREVE \u U A- LATIN SMALL LETTER U WITH BREVE \u u A- LATIN CAPITAL LETTER U WITH RING ABOVE \r U A- LATIN SMALL LETTER U WITH RING ABOVE \r u A- LATIN CAPITAL LETTER U WITH DOUBLE ACUTE \H U A- LATIN SMALL LETTER U WITH DOUBLE ACUTE \H u A- LATIN CAPITAL LETTER U WITH OGONEK \k U A- LATIN SMALL LETTER U WITH OGONEK \k u A- LATIN CAPITAL LETTER W WITH CIRCUMFLEX \^W A- LATIN SMALL LETTER W WITH CIRCUMFLEX \^w A- LATIN CAPITAL LETTER Y WITH CIRCUMFLEX \^Y A- LATIN SMALL LETTER Y WITH CIRCUMFLEX \^y A- LATIN CAPITAL LETTER Y WITH DIAERESIS \"Y A- LATIN CAPITAL LETTER Z WITH ACUTE \'Z A- LATIN SMALL LETTER Z WITH ACUTE \'z A- LATIN CAPITAL LETTER Z WITH DOT ABOVE \.Z A- LATIN SMALL LETTER Z WITH DOT ABOVE \.z A- LATIN CAPITAL LETTER Z WITH CARON \v Z A- LATIN SMALL LETTER Z WITH CARON \v z A- LATIN CAPITAL LETTER DZ WITH CARON D\v Z A- LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON D\v z A- LATIN SMALL LETTER DZ WITH CARON d\v z A> LATIN CAPITAL LETTER LJ LJ A> LATIN CAPITAL LETTER L WITH SMALL LETTER J Lj A> LATIN SMALL LETTER LJ lj A> LATIN CAPITAL LETTER NJ NJ A> LATIN CAPITAL LETTER N WITH SMALL LETTER J Nj A> LATIN SMALL LETTER NJ nj A- LATIN CAPITAL LETTER A WITH CARON \v A A- LATIN SMALL LETTER A WITH CARON \v a A- LATIN CAPITAL LETTER I WITH CARON \v I A- LATIN SMALL LETTER I WITH CARON \v\i A- LATIN CAPITAL LETTER O WITH CARON \v O A- LATIN SMALL LETTER O WITH CARON \v o A- LATIN CAPITAL LETTER U WITH CARON \v U A- LATIN SMALL LETTER U WITH CARON \v u A- LATIN CAPITAL LETTER G WITH CARON \v G A- LATIN SMALL LETTER G WITH CARON \v g A- LATIN CAPITAL LETTER K WITH CARON \v K A- LATIN SMALL LETTER K WITH CARON \v k A- LATIN CAPITAL LETTER O WITH OGONEK \k O A- LATIN SMALL LETTER O WITH OGONEK \k o A- LATIN SMALL LETTER J WITH CARON \v\j A> LATIN CAPITAL LETTER DZ DZ A> LATIN CAPITAL LETTER D WITH SMALL LETTER Z Dz A> LATIN SMALL LETTER DZ dz A- LATIN CAPITAL LETTER G WITH ACUTE \'G A- LATIN SMALL LETTER G WITH ACUTE \'g A- LATIN CAPITAL LETTER AE WITH ACUTE \'\AE A- LATIN SMALL LETTER AE WITH ACUTE \'\ae A- LATIN CAPITAL LETTER O WITH STROKE AND ACUTE \'\O A- LATIN SMALL LETTER O WITH STROKE AND ACUTE \'\o A- LATIN CAPITAL LETTER ETH \DH A- LATIN SMALL LETTER ETH \dh A- LATIN CAPITAL LETTER THORN \TH A- LATIN SMALL LETTER THORN \th A- LATIN CAPITAL LETTER D WITH STROKE \DJ A- LATIN SMALL LETTER D WITH STROKE \dj A- LATIN CAPITAL LETTER D WITH DOT BELOW \d D A- LATIN SMALL LETTER D WITH DOT BELOW \d d A- LATIN CAPITAL LETTER L WITH DOT BELOW \d L A- LATIN SMALL LETTER L WITH DOT BELOW \d l A- LATIN CAPITAL LETTER M WITH DOT BELOW \d M A- LATIN SMALL LETTER M WITH DOT BELOW \d m A- LATIN CAPITAL LETTER N WITH DOT BELOW \d N A- LATIN SMALL LETTER N WITH DOT BELOW \d n A- LATIN CAPITAL LETTER R WITH DOT BELOW \d R A- LATIN SMALL LETTER R WITH DOT BELOW \d r A- LATIN CAPITAL LETTER S WITH DOT BELOW \d S A- LATIN SMALL LETTER S WITH DOT BELOW \d s A- LATIN CAPITAL LETTER T WITH DOT BELOW \d T A- LATIN SMALL LETTER T WITH DOT BELOW \d t A- LATIN CAPITAL LETTER S WITH COMMA BELOW \textcommabelow S A- LATIN SMALL LETTER S WITH COMMA BELOW \textcommabelow s A- LATIN CAPITAL LETTER T WITH COMMA BELOW \textcommabelow T A- LATIN SMALL LETTER T WITH COMMA BELOW \textcommabelow t M- SCRIPT SMALL L \ell M- SQUARE ROOT \surd M- INFINITY \infty M- INTEGRAL \int M- INTERSECTION \cap M- UNION \cup M- RIGHTWARDS ARROW \rightarrow M- RIGHTWARDS DOUBLE ARROW \Rightarrow M- LEFTWARDS ARROW \leftarrow M- LEFTWARDS DOUBLE ARROW \Leftarrow M- LOGICAL OR \vee M- LOGICAL AND \wedge M- ALMOST EQUAL TO \approx M- NOT EQUAL TO \neq M- LESS-THAN OR EQUAL TO \leq M- GREATER-THAN OR EQUAL TO \geq M- FOR ALL \forall M- COMPLEMENT \complement M- PARTIAL DIFFERENTIAL \partial M- THERE EXISTS \exists M- THERE DOES NOT EXIST \nexists M- EMPTY SET \emptyset M- NABLA \nabla M- ELEMENT OF \in M- NOT AN ELEMENT OF \notin M- CONTAINS AS MEMBER \ni M- DOES NOT CONTAIN AS MEMBER \notni M- END OF PROOF \blacksquare M- N-ARY PRODUCT \prod M- N-ARY COPRODUCT \coprod M- N-ARY SUMMATION \sum M- SUBSET OF \subset M- SUPERSET OF \supset M- NOT A SUBSET OF \not\subset M- NOT A SUPERSET OF \not\supset M- SUBSET OF OR EQUAL TO \subseteq M- SUPERSET OF OR EQUAL TO \supseteq M- NEITHER A SUBSET OF NOR EQUAL TO \nsubseteq M- NEITHER A SUPERSET OF NOR EQUAL TO \nsupseteq M- SUBSET OF WITH NOT EQUAL TO \subsetneq M- SUPERSET OF WITH NOT EQUAL TO \supsetneq T- MODIFIER LETTER CIRCUMFLEX ACCENT \^{} T- CARON \v{} T- BREVE \u{} T- DOT ABOVE \.{} T- RING ABOVE \r{} T- OGONEK \k{} T- DOUBLE ACUTE ACCENT \H{} A> LATIN SMALL LIGATURE FI fi A> LATIN SMALL LIGATURE FL fl A> LATIN SMALL LIGATURE FF ff A> LATIN SMALL LIGATURE FFI ffi A> LATIN SMALL LIGATURE FFL ffl A> LATIN SMALL LIGATURE ST st M- GREEK SMALL LETTER ALPHA \alpha M- GREEK SMALL LETTER BETA \beta M- GREEK SMALL LETTER GAMMA \gamma M- GREEK SMALL LETTER DELTA \delta M- GREEK SMALL LETTER EPSILON \epsilon M- GREEK SMALL LETTER ZETA \zeta M- GREEK SMALL LETTER ETA \eta M- GREEK SMALL LETTER THETA \theta T< GREEK SMALL LETTER THETA \texttheta M- GREEK SMALL LETTER IOTA \iota M- GREEK SMALL LETTER KAPPA \kappa M- GREEK SMALL LETTER LAMDA \lambda M- GREEK SMALL LETTER MU \mu M- GREEK SMALL LETTER NU \nu M- GREEK SMALL LETTER XI \xi M- GREEK SMALL LETTER OMICRON \omicron M- GREEK SMALL LETTER PI \pi M- GREEK SMALL LETTER RHO \rho M- GREEK SMALL LETTER SIGMA \sigma M- GREEK SMALL LETTER TAU \tau M- GREEK SMALL LETTER UPSILON \upsilon M- GREEK SMALL LETTER PHI \phi M- GREEK PHI SYMBOL \varphi M- GREEK SMALL LETTER CHI \chi M- GREEK SMALL LETTER PSI \psi M- GREEK SMALL LETTER OMEGA \omega M- GREEK CAPITAL LETTER ALPHA \Alpha M- GREEK CAPITAL LETTER BETA \Beta M- GREEK CAPITAL LETTER GAMMA \Gamma M- GREEK CAPITAL LETTER DELTA \Delta M- GREEK CAPITAL LETTER EPSILON \Epsilon M- GREEK CAPITAL LETTER ZETA \Zeta M- GREEK CAPITAL LETTER ETA \Eta M- GREEK CAPITAL LETTER THETA \Theta M- GREEK CAPITAL LETTER IOTA \Iota M- GREEK CAPITAL LETTER KAPPA \Kappa M- GREEK CAPITAL LETTER LAMDA \Lambda M- GREEK CAPITAL LETTER MU \Mu M- GREEK CAPITAL LETTER NU \Nu M- GREEK CAPITAL LETTER XI \Xi M- GREEK CAPITAL LETTER OMICRON \Omicron M- GREEK CAPITAL LETTER PI \Pi M- GREEK CAPITAL LETTER RHO \Rho M- GREEK CAPITAL LETTER SIGMA \Sigma M- GREEK CAPITAL LETTER TAU \Tau M- GREEK CAPITAL LETTER UPSILON \Upsilon M- GREEK CAPITAL LETTER PHI \Phi M- GREEK CAPITAL LETTER CHI \Chi M- GREEK CAPITAL LETTER PSI \Psi M- GREEK CAPITAL LETTER OMEGA \Omega T- COPYRIGHT SIGN \copyright T- COPYRIGHT SIGN \textcopyright T- LATIN CAPITAL LETTER A WITH ACUTE \'A T- LATIN CAPITAL LETTER I WITH ACUTE \'I A- HORIZONTAL ELLIPSIS \ldots M- TRADE MARK SIGN ^{TM} T- TRADE MARK SIGN \texttrademark T- REGISTERED SIGN \textregistered T> LATIN CAPITAL LETTER O WITH OGONEK AND MACRON \textogonekcentered{\=O} T> LATIN SMALL LETTER O WITH OGONEK AND MACRON \textogonekcentered{\=o} M- DOUBLE-STRUCK CAPITAL N \mathbb N M- DOUBLE-STRUCK CAPITAL Z \mathbb Z M- DOUBLE-STRUCK CAPITAL Q \mathbb Q M- DOUBLE-STRUCK CAPITAL R \mathbb R M- DOUBLE-STRUCK CAPITAL C \mathbb C ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736445.004326 latexcodec-3.0.0/latexcodec.egg-info/0000755005105600024240000000000014572100775016740 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736444.0 latexcodec-3.0.0/latexcodec.egg-info/PKG-INFO0000644005105600024240000001136314572100774020040 0ustar00dma0mtURP_dmaMetadata-Version: 2.1 Name: latexcodec Version: 3.0.0 Summary: A lexer and codec to work with LaTeX code in Python. Home-page: https://github.com/mcmtroffaes/latexcodec Download-URL: http://pypi.python.org/pypi/latexcodec Author: Matthias C. M. Troffaes Author-email: matthias.troffaes@gmail.com License: MIT Platform: any Classifier: Development Status :: 5 - Production/Stable Classifier: Environment :: Console Classifier: Intended Audience :: Developers Classifier: License :: OSI Approved :: MIT License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Topic :: Text Processing :: Markup :: LaTeX Classifier: Topic :: Text Processing :: Filters Requires-Python: >=3.7 License-File: LICENSE.rst License-File: AUTHORS.rst * **Instead of using latexcodec, I encourage you to consider pylatexenc instead, which is far superior:** https://github.com/phfaist/pylatexenc * Download: http://pypi.python.org/pypi/latexcodec/#downloads * Documentation: http://latexcodec.readthedocs.org/ * Development: http://github.com/mcmtroffaes/latexcodec/ .. |ci| image:: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml/badge.svg :target: https://github.com/mcmtroffaes/latexcodec/actions/workflows/python-package.yml :alt: ci .. |codecov| image:: https://codecov.io/gh/mcmtroffaes/latexcodec/branch/develop/graph/badge.svg :target: https://codecov.io/gh/mcmtroffaes/latexcodec :alt: codecov The codec provides a convenient way of going between text written in LaTeX and unicode. Since it is not a LaTeX compiler, it is more appropriate for short chunks of text, such as a paragraph or the values of a BibTeX entry, and it is not appropriate for a full LaTeX document. In particular, its behavior on the LaTeX commands that do not simply select characters is intended to allow the unicode representation to be understandable by a human reader, but is not canonical and may require hand tuning to produce the desired effect. The encoder does a best effort to replace unicode characters outside of the range used as LaTeX input (ascii by default) with a LaTeX command that selects the character. More technically, the unicode code point is replaced by a LaTeX command that selects a glyph that reasonably represents the code point. Unicode characters with special uses in LaTeX are replaced by their LaTeX equivalents. For example, ====================== =================== original text encoded LaTeX ====================== =================== ``¥`` ``\yen`` ``ü`` ``\"u`` ``\N{NO-BREAK SPACE}`` ``~`` ``~`` ``\textasciitilde`` ``%`` ``\%`` ``#`` ``\#`` ``\textbf{x}`` ``\textbf{x}`` ====================== =================== The decoder does a best effort to replace LaTeX commands that select characters with the unicode for the character they are selecting. For example, ===================== ====================== original LaTeX decoded unicode ===================== ====================== ``\yen`` ``¥`` ``\"u`` ``ü`` ``~`` ``\N{NO-BREAK SPACE}`` ``\textasciitilde`` ``~`` ``\%`` ``%`` ``\#`` ``#`` ``\textbf{x}`` ``\textbf {x}`` ``#`` ``#`` ===================== ====================== In addition, comments are dropped (including the final newline that marks the end of a comment), paragraphs are canonicalized into double newlines, and other newlines are left as is. Spacing after LaTeX commands is also canonicalized. For example, :: hi % bye there\par world \textbf {awesome} is decoded as :: hi there world \textbf {awesome} When decoding, LaTeX commands not directly selecting characters (for example, macros and formatting commands) are passed through unchanged. The same happens for LaTeX commands that select characters but are not yet recognized by the codec. Either case can result in a hybrid unicode string in which some characters are understood as literally the character and others as parts of unexpanded commands. Consequently, at times, backslashes will be left intact for denoting the start of a potentially unrecognized control sequence. Given the numerous and changing packages providing such LaTeX commands, the codec will never be complete, and new translations of unrecognized unicode or unrecognized LaTeX symbols are always welcome. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736444.0 latexcodec-3.0.0/latexcodec.egg-info/SOURCES.txt0000644005105600024240000000120214572100774020616 0ustar00dma0mtURP_dma.coveragerc AUTHORS.rst CHANGELOG.rst INSTALL.rst LICENSE.rst MANIFEST.in README.rst VERSION mypy.ini setup.py doc/Makefile doc/api.rst doc/authors.rst doc/changes.rst doc/conf.py doc/index.rst doc/license.rst doc/make.bat doc/quickstart.rst doc/api/codec.rst doc/api/lexer.rst latexcodec/__init__.py latexcodec/codec.py latexcodec/lexer.py latexcodec/py.typed latexcodec/table.txt latexcodec.egg-info/PKG-INFO latexcodec.egg-info/SOURCES.txt latexcodec.egg-info/dependency_links.txt latexcodec.egg-info/top_level.txt latexcodec.egg-info/zip-safe test/conftest.py test/test_install_example.py test/test_latex_codec.py test/test_latex_lexer.py././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736444.0 latexcodec-3.0.0/latexcodec.egg-info/dependency_links.txt0000644005105600024240000000000114572100774023005 0ustar00dma0mtURP_dma ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736444.0 latexcodec-3.0.0/latexcodec.egg-info/top_level.txt0000644005105600024240000000001314572100774021463 0ustar00dma0mtURP_dmalatexcodec ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736444.0 latexcodec-3.0.0/latexcodec.egg-info/zip-safe0000644005105600024240000000000114572100774020367 0ustar00dma0mtURP_dma ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/mypy.ini0000644005105600024240000000020014572077612014625 0ustar00dma0mtURP_dma[mypy] files = latexcodec/**/*.py,test/*.py,setup.py check_untyped_defs = True [mypy-setuptools] ignore_missing_imports = True ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736445.077319 latexcodec-3.0.0/setup.cfg0000644005105600024240000000004614572100775014754 0ustar00dma0mtURP_dma[egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709736177.0 latexcodec-3.0.0/setup.py0000644005105600024240000000300014572100361014625 0ustar00dma0mtURP_dma# -*- coding: utf-8 -*- import io from setuptools import setup, find_packages def readfile(filename): with io.open(filename, encoding="utf-8") as stream: return stream.read().split("\n") readme = readfile("README.rst")[5:] # skip title and badges version = readfile("VERSION")[0].strip() setup( name='latexcodec', version=version, url='https://github.com/mcmtroffaes/latexcodec', download_url='http://pypi.python.org/pypi/latexcodec', license='MIT', author='Matthias C. M. Troffaes', author_email='matthias.troffaes@gmail.com', description=readme[0], long_description="\n".join(readme[2:]), zip_safe=True, classifiers=[ 'Development Status :: 5 - Production/Stable', 'Environment :: Console', 'Intended Audience :: Developers', 'License :: OSI Approved :: MIT License', 'Operating System :: OS Independent', 'Programming Language :: Python', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9', 'Programming Language :: Python :: 3.10', 'Programming Language :: Python :: 3.11', 'Programming Language :: Python :: 3.12', 'Topic :: Text Processing :: Markup :: LaTeX', 'Topic :: Text Processing :: Filters', ], platforms='any', packages=find_packages(), package_data={'latexcodec': ['table.txt']}, python_requires='>=3.7', ) ././@PaxHeader0000000000000000000000000000003300000000000010211 xustar0027 mtime=1709736445.060318 latexcodec-3.0.0/test/0000755005105600024240000000000014572100775014112 5ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/test/conftest.py0000644005105600024240000000000014572077612016302 0ustar00dma0mtURP_dma././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/test/test_install_example.py0000644005105600024240000000253214572077612020711 0ustar00dma0mtURP_dmadef test_install_example_1(): import latexcodec # noqa text_latex = b"\\'el\\`eve" assert text_latex.decode("latex") == u"élève" text_unicode = u"ångström" assert text_unicode.encode("latex") == b'\\aa ngstr\\"om' def test_install_example_2(): import codecs import latexcodec # noqa text_latex = u"\\'el\\`eve" assert codecs.decode(text_latex, "ulatex") == u"élève" # type: ignore text_unicode = u"ångström" assert codecs.encode(text_unicode, "ulatex") == '\\aa ngstr\\"om' def test_install_example_3(): import latexcodec # noqa text_latex = b"\xfe" assert text_latex.decode("latex+latin1") == u"þ" assert text_latex.decode("latex+latin2") == u"ţ" text_unicode = u"ţ" assert text_unicode.encode("latex+latin1") == b'\\c t' # ţ is not latin1 assert text_unicode.encode("latex+latin2") == b'\xfe' # but it is latin2 def test_install_example_4(): import codecs import latexcodec # noqa text_unicode = '⌨' # \u2328 = keyboard symbol, currently not translated try: # raises a value error as \u2328 cannot be encoded into latex codecs.encode(text_unicode, "ulatex+ascii") except ValueError: pass assert codecs.encode(text_unicode, "ulatex+ascii", "keep") == '⌨' assert codecs.encode(text_unicode, "ulatex+utf8") == '⌨' ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/test/test_latex_codec.py0000644005105600024240000004172414572077612020010 0ustar00dma0mtURP_dma# -*- coding: utf-8 -*- """Tests for the latex codec.""" from __future__ import print_function import codecs from io import BytesIO import pytest from unittest import TestCase import latexcodec def test_getregentry(): assert latexcodec.codec.getregentry() is not None def test_find_latex(): assert latexcodec.codec.find_latex('hello') is None def test_latex_incremental_decoder_getstate(): encoder = codecs.getincrementaldecoder('latex')() with pytest.raises(NotImplementedError): encoder.getstate() def test_latex_incremental_decoder_setstate(): encoder = codecs.getincrementaldecoder('latex')() state = (b'', 0) with pytest.raises(NotImplementedError): encoder.setstate(state) def split_input(input_): """Helper function for testing the incremental encoder and decoder.""" assert isinstance(input_, (str, bytes)) if input_: for i in range(len(input_)): if i + 1 < len(input_): yield input_[i:i + 1], False else: yield input_[i:i + 1], True else: yield input_, True class TestDecoder(TestCase): """Stateless decoder tests.""" maxDiff = None def decode(self, text_utf8, text_latex, inputenc=None): """Main test function.""" encoding = 'latex+' + inputenc if inputenc else 'latex' decoded, n = codecs.getdecoder(encoding)(text_latex) self.assertEqual((decoded, n), (text_utf8, len(text_latex))) def test_invalid_type(self): with pytest.raises(TypeError): codecs.getdecoder("latex")(object()) # type: ignore def test_invalid_code(self): with pytest.raises(ValueError): # b'\xe9' is invalid utf-8 code self.decode('', b'\xe9 ', 'utf-8') def test_null(self): self.decode('', b'') def test_maelstrom(self): self.decode(u"mælström", br'm\ae lstr\"om') def test_maelstrom_latin1(self): self.decode(u"mælström", b'm\\ae lstr\xf6m', 'latin1') def test_laren(self): self.decode( u"© låren av björn", br'\copyright\ l\aa ren av bj\"orn') def test_laren_brackets(self): self.decode( u"© l{å}ren av bj{ö}rn", br'\copyright\ l{\aa}ren av bj{\"o}rn') def test_laren_latin1(self): self.decode( u"© låren av björn", b'\\copyright\\ l\xe5ren av bj\xf6rn', 'latin1') def test_droitcivil(self): self.decode( u"Même s'il a fait l'objet d'adaptations suite à l'évolution, " u"la transformation sociale, économique et politique du pays, " u"le code civil fran{ç}ais est aujourd'hui encore le texte " u"fondateur " u"du droit civil français mais aussi du droit civil belge " u"ainsi que " u"de plusieurs autres droits civils.", b"M\\^eme s'il a fait l'objet d'adaptations suite " b"\\`a l'\\'evolution, \nla transformation sociale, " b"\\'economique et politique du pays, \nle code civil " b"fran\\c{c}ais est aujourd'hui encore le texte fondateur \n" b"du droit civil fran\\c cais mais aussi du droit civil " b"belge ainsi que \nde plusieurs autres droits civils.", ) def test_oeuf(self): self.decode( u"D'un point de vue diététique, l'œuf apaise la faim.", br"D'un point de vue di\'et\'etique, l'\oe uf apaise la faim.", ) def test_oeuf_latin1(self): self.decode( u"D'un point de vue diététique, l'œuf apaise la faim.", b"D'un point de vue di\xe9t\xe9tique, l'\\oe uf apaise la faim.", 'latin1' ) def test_alpha(self): self.decode(u"α", b"$\\alpha$") def test_maelstrom_multibyte_encoding(self): self.decode(u"\\c öké", b'\\c \xc3\xb6k\xc3\xa9', 'utf8') def test_serafin(self): self.decode(u"Seraf{\xed}n", b"Seraf{\\'i}n") def test_astrom(self): self.decode(u"{\xc5}str{\xf6}m", b'{\\AA}str{\\"o}m') def test_space_1(self): self.decode(u"ææ", br'\ae \ae') def test_space_2(self): self.decode(u"æ æ", br'\ae\ \ae') def test_space_3(self): self.decode(u"æ æ", br'\ae \quad \ae') def test_number_sign_1(self): self.decode(u"# hello", br'\#\ hello') def test_number_sign_2(self): # LaTeX does not absorb the space following '\#': # check decoding is correct self.decode(u"# hello", br'\# hello') def test_number_sign_3(self): # a single '#' is not valid LaTeX: # for the moment we ignore this error and return # unchanged self.decode(u"# hello", br'# hello') def test_underscore(self): self.decode(u"_", br'\_') def test_dz(self): self.decode(u"DZ", br'DZ') def test_newline(self): self.decode(u"hello world", b"hello\nworld") def test_par1(self): self.decode(u"hello\n\nworld", b"hello\n\nworld") def test_par2(self): self.decode(u"hello\n\nworld", b"hello\\par world") def test_par3(self): self.decode(u"hello\n\nworld", b"hello \\par world") def test_ogonek1(self): self.decode(u"ĄąĘęĮįǪǫŲų", br'\k A\k a\k E\k e\k I\k i\k O\k o\k U\k u') def test_ogonek2(self): # note: should decode into u"Ǭǭ" but can't support this yet... self.decode(u"\\textogonekcentered {Ō}\\textogonekcentered {ō}", br'\textogonekcentered{\=O}\textogonekcentered{\=o}') def test_math_spacing_dollar(self): self.decode('This is a ψ test.', br'This is a $\psi$ test.') def test_math_spacing_brace(self): self.decode('This is a ψ test.', br'This is a \(\psi\) test.') def test_double_math(self): # currently no attempt to translate maths inside $$ self.decode('This is a $$\\psi $$ test.', br'This is a $$\psi$$ test.') def test_tilde(self): self.decode('This is a ˜, ˷, ∼ and ~test.', (br'This is a \~{}, \texttildelow, ' br'$\sim$ and \textasciitilde test.')) def test_backslash(self): self.decode('This is a \\ \\test.', br'This is a $\backslash$ \textbackslash test.') def test_percent(self): self.decode('This is a % test.', br'This is a \% test.') def test_math_minus(self): self.decode('This is a − test.', br'This is a $-$ test.') def test_swedish_again(self): self.decode( u"l{å}ren l{Å}ren", br'l{\r a}ren l{\r A}ren') def test_double_quotes(self): self.decode(u"“a+b”", br"``a+b''") def test_double_quotes_unicode(self): self.decode(u"“á”", u"``á''".encode("utf8"), "utf8") def test_double_quotes_gb2312(self): self.decode(u"“你好”", u"``你好''".encode('gb2312'), 'gb2312') def test_ell(self): self.decode(u"ℓ", br"$\ell$") def test_theta(self): self.decode(u"θ", br"$\theta$") self.decode(u"θ", br"\texttheta") def test_decode_comment(self): self.decode(u"\\\\", br"\\%") self.decode(u"% abc \\\\\\\\% ghi", b"\\% abc\n\\\\% def\n\\\\\\% ghi") def test_decode_lower_quotes(self): self.decode(u"„", br",,") self.decode(u"„", br"\glqq") def test_decode_guillemet(self): self.decode(u"«quote»", br"\guillemotleft quote\guillemotright") def test_decode_reals(self): self.decode(u"ℝ", br"$\mathbb R$") self.decode(u"ℝ", br"$\mathbb{R}$") class TestStreamDecoder(TestDecoder): """Stream decoder tests.""" def decode(self, text_utf8, text_latex, inputenc=None): encoding = 'latex+' + inputenc if inputenc else 'latex' stream = BytesIO(text_latex) reader = codecs.getreader(encoding)(stream) self.assertEqual(text_utf8, reader.read()) # in this test, BytesIO(object()) is eventually called def test_invalid_type(self): TestDecoder.test_invalid_type(self) class TestIncrementalDecoder(TestDecoder): """Incremental decoder tests.""" def decode(self, text_utf8, text_latex, inputenc=None): encoding = 'latex+' + inputenc if inputenc else 'latex' decoder = codecs.getincrementaldecoder(encoding)() decoded_parts = ( decoder.decode(text_latex_part, final) for text_latex_part, final in split_input(text_latex)) self.assertEqual(text_utf8, ''.join(decoded_parts)) class TestEncoder(TestCase): """Stateless encoder tests.""" def encode(self, text_utf8, text_latex, inputenc=None, errors='strict'): """Main test function.""" encoding = 'latex+' + inputenc if inputenc else 'latex' encoded, n = codecs.getencoder(encoding)(text_utf8, errors=errors) self.assertEqual((encoded, n), (text_latex, len(text_utf8))) def test_invalid_type(self): with pytest.raises(TypeError): codecs.getencoder("latex")(object()) # type: ignore # note concerning test_invalid_code_* methods: # '\u2328' (0x2328 = 9000) is unicode for keyboard symbol # we currently provide no translation for this into LaTeX code def test_invalid_code_strict(self): with pytest.raises(ValueError): self.encode('\u2328', b'', 'ascii', 'strict') def test_invalid_code_ignore(self): self.encode('\u2328', b'', 'ascii', 'ignore') def test_invalid_code_replace(self): self.encode('\u2328', b'{\\char9000}', 'ascii', 'replace') def test_invalid_code_baderror(self): with pytest.raises(ValueError): self.encode('\u2328', b'', 'ascii', '**baderror**') def test_null(self): self.encode('', b'') def test_maelstrom(self): self.encode(u"mælström", br'm\ae lstr\"om') def test_maelstrom_latin1(self): self.encode(u"mælström", b'm\xe6lstr\xf6m', 'latin1') def test_laren(self): self.encode( u"© låren av björn", br'\copyright\ l\aa ren av bj\"orn') def test_laren_latin1(self): self.encode( u"© låren av björn", b'\xa9 l\xe5ren av bj\xf6rn', 'latin1') def test_droitcivil(self): self.encode( u"Même s'il a fait l'objet d'adaptations suite à l'évolution, \n" u"la transformation sociale, économique et politique du pays, \n" u"le code civil fran{ç}ais est aujourd'hui encore le texte " u"fondateur \n" u"du droit civil français mais aussi du droit civil belge " u"ainsi que \n" u"de plusieurs autres droits civils.", b"M\\^eme s'il a fait l'objet d'adaptations suite " b"\\`a l'\\'evolution, \nla transformation sociale, " b"\\'economique et politique du pays, \nle code civil " b"fran{\\c c}ais est aujourd'hui encore le texte fondateur \n" b"du droit civil fran\\c cais mais aussi du droit civil " b"belge ainsi que \nde plusieurs autres droits civils.", ) def test_oeuf(self): self.encode( u"D'un point de vue diététique, l'œuf apaise la faim.", br"D'un point de vue di\'et\'etique, l'\oe uf apaise la faim.", ) def test_oeuf_latin1(self): self.encode( u"D'un point de vue diététique, l'œuf apaise la faim.", b"D'un point de vue di\xe9t\xe9tique, l'\\oe uf apaise la faim.", 'latin1' ) def test_alpha(self): self.encode(u"α", b"$\\alpha$") def test_serafin(self): self.encode(u"Seraf{\xed}n", b"Seraf{\\'\\i }n") def test_space_1(self): self.encode(u"ææ", br'\ae \ae') def test_space_2(self): self.encode(u"æ æ", br'\ae\ \ae') def test_space_3(self): self.encode(u"æ æ", br'\ae \quad \ae') def test_number_sign(self): # note: no need for control space after \# self.encode(u"# hello", br'\# hello') def test_underscore(self): self.encode(u"_", br'\_') def test_dz1(self): self.encode(u"DZ", br'DZ') def test_dz2(self): self.encode(u"DZ", br'DZ') def test_newline(self): self.encode(u"hello\nworld", b"hello\nworld") def test_par1(self): self.encode(u"hello\n\nworld", b"hello\n\nworld") def test_par2(self): self.encode(u"hello\\par world", b"hello\\par world") def test_ogonek1(self): self.encode(u"ĄąĘęĮįǪǫŲų", br'\k A\k a\k E\k e\k I\k i\k O\k o\k U\k u') def test_ogonek2(self): self.encode(u"Ǭǭ", br'\textogonekcentered{\=O}\textogonekcentered{\=o}') def test_math_spacing(self): self.encode('This is a ψ test.', br'This is a $\psi$ test.') def test_double_math(self): # currently no attempt to translate maths inside $$ self.encode('This is a $$\\psi$$ test.', br'This is a $$\psi$$ test.') def test_tilde(self): self.encode('This is a ˜, ˷, ∼ and ~test.', (br'This is a \~{}, \texttildelow , ' br'$\sim$ and \textasciitilde test.')) def test_percent(self): self.encode('This is a % test.', br'This is a \% test.') def test_hyphen(self): self.encode('This is a \N{HYPHEN} test.', br'This is a - test.') def test_math_minus(self): self.encode(u'This is a − test.', br'This is a $-$ test.') def test_double_quotes(self): self.encode(u"“a+b”", br"``a+b''") def test_double_quotes_unicode(self): self.encode(u"“á”", br"``\'a''") def test_thin_space(self): self.encode(u"a\u2009b", b"a b") def test_ell(self): self.encode(u"ℓ", br"$\ell$") def test_theta(self): self.encode(u"θ", br"$\theta$") def test_encode_lower_quotes(self): self.encode(u"„", br",,") def test_encode_guillemet(self): self.encode(u"«quote»", br"\guillemotleft quote\guillemotright") def test_encode_reals(self): self.encode(u"ℝ", br"$\mathbb R$") def test_encode_ligatures(self): self.encode(u"ff fi fl ffi ffl st", br"ff fi fl ffi ffl st") def test_encode_zero_width(self): self.encode(u"1\u200b2\u200c3\u200d4", br"1\hspace{0pt}2{}34") class TestStreamEncoder(TestEncoder): """Stream encoder tests.""" def encode(self, text_utf8, text_latex, inputenc=None, errors='strict'): encoding = 'latex+' + inputenc if inputenc else 'latex' stream = BytesIO() writer = codecs.getwriter(encoding)(stream, errors=errors) writer.write(text_utf8) self.assertEqual(text_latex, stream.getvalue()) class TestIncrementalEncoder(TestEncoder): """Incremental encoder tests.""" def encode(self, text_utf8, text_latex, inputenc=None, errors='strict'): encoding = 'latex+' + inputenc if inputenc else 'latex' encoder = codecs.getincrementalencoder(encoding)(errors=errors) encoded_parts = ( encoder.encode(text_utf8_part, final) for text_utf8_part, final in split_input(text_utf8)) self.assertEqual(text_latex, b''.join(encoded_parts)) class TestUnicodeDecoder(TestDecoder): def decode(self, text_utf8, text_latex, inputenc=None): """Main test function.""" text_latex = text_latex.decode(inputenc if inputenc else "ascii") decoded, n = codecs.getdecoder('ulatex')(text_latex) self.assertEqual((decoded, n), (text_utf8, len(text_latex))) class TestUnicodeEncoder(TestEncoder): def encode(self, text_utf8, text_latex, inputenc=None, errors='strict'): """Main test function.""" encoding = 'ulatex+' + inputenc if inputenc else 'ulatex' text_latex = text_latex.decode(inputenc if inputenc else 'ascii') encoded, n = codecs.getencoder(encoding)(text_utf8, errors=errors) self.assertEqual((encoded, n), (text_latex, len(text_utf8))) def uencode(self, text_utf8, text_ulatex, inputenc=None, errors='strict'): """Main test function.""" encoding = 'ulatex+' + inputenc if inputenc else 'ulatex' encoded, n = codecs.getencoder(encoding)(text_utf8, errors=errors) self.assertEqual((encoded, n), (text_ulatex, len(text_utf8))) def test_ulatex_ascii(self): self.uencode(u'# ψ', u'\\# $\\psi$', 'ascii') def test_ulatex_utf8(self): self.uencode(u'# ψ', u'\\# ψ', 'utf8') # the following tests rely on the fact that \u2328 is not in our # translation table def test_ulatex_ascii_invalid(self): with pytest.raises(ValueError): self.uencode(u'# \u2328', u'', 'ascii') def test_ulatex_utf8_invalid(self): self.uencode(u'# ψ \u2328', u'\\# ψ \u2328', 'utf8') def test_invalid_code_keep(self): self.uencode(u'# ψ \u2328', u'\\# $\\psi$ \u2328', 'ascii', 'keep') ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1709735818.0 latexcodec-3.0.0/test/test_latex_lexer.py0000644005105600024240000003415014572077612020045 0ustar00dma0mtURP_dma# -*- coding: utf-8 -*- """Tests for the tex lexer.""" from typing import Type, TypeVar, Generic, List, Iterator import pytest from unittest import TestCase from latexcodec.lexer import ( LatexLexer, LatexIncrementalLexer, LatexIncrementalDecoder, LatexIncrementalEncoder, Token) class MockLexer(LatexLexer): tokens = [ ('chars', 'mock'), ('unknown', '.'), ] class MockIncrementalDecoder(LatexIncrementalDecoder): tokens = [ ('chars', 'mock'), ('unknown', '.'), ] def test_token_create_with_args(): t = Token('hello', 'world') assert t.name == 'hello' assert t.text == 'world' def test_token_assign_name(): with pytest.raises(AttributeError): t = Token('hello', 'world') t.name = 'test' # type: ignore def test_token_assign_text(): with pytest.raises(AttributeError): t = Token('hello', 'world') t.text = 'test' # type: ignore def test_token_assign_other(): with pytest.raises(AttributeError): t = Token('hello', 'world') t.blabla = 'test' # type: ignore class BaseLatexLexerTest(TestCase): errors = 'strict' Lexer: Type[LatexLexer] def setUp(self): self.lexer = self.Lexer(errors=self.errors) def lex_it(self, latex_code, latex_tokens, final=False): tokens = self.lexer.get_raw_tokens(latex_code, final=final) self.assertEqual( list(token.text for token in tokens), latex_tokens) def tearDown(self): del self.lexer class LatexLexerTest(BaseLatexLexerTest): Lexer = LatexLexer def test_null(self): self.lex_it('', [], final=True) def test_hello(self): self.lex_it( 'hello! [#1] This \\is\\ \\^ a \ntest.\n' ' \nHey.\n\n\\# x \\#x', r'h|e|l|l|o|!| | |[|#1|]| |T|h|i|s| |\is|\ | | |\^| |a| ' '|\n|t|e|s|t|.|\n| | | | |\n|H|e|y|.|\n|\n' r'|\#| |x| |\#|x'.split('|'), final=True ) def test_comment(self): self.lex_it( 'test% some comment\ntest', 't|e|s|t|% some comment|\n|t|e|s|t'.split('|'), final=True ) def test_comment_newline(self): self.lex_it( 'test% some comment\n\ntest', 't|e|s|t|% some comment|\n|\n|t|e|s|t'.split('|'), final=True ) def test_control(self): self.lex_it( '\\hello\\world', '\\hello|\\world'.split('|'), final=True ) def test_control_whitespace(self): self.lex_it( '\\hello \\world ', '\\hello| | | |\\world| | | '.split('|'), final=True ) def test_controlx(self): self.lex_it( '\\#\\&', '\\#|\\&'.split('|'), final=True ) def test_controlx_whitespace(self): self.lex_it( '\\# \\& ', '\\#| | | | |\\&| | | '.split('|'), final=True ) def test_buffer(self): self.lex_it( 'hi\\t', 'h|i'.split('|'), ) self.lex_it( 'here', ['\\there'], final=True, ) def test_state(self): self.lex_it( 'hi\\t', 'h|i'.split('|'), ) state = self.lexer.getstate() self.lexer.reset() self.lex_it( 'here', 'h|e|r|e'.split('|'), final=True, ) self.lexer.setstate(state) self.lex_it( 'here', ['\\there'], final=True, ) def test_decode(self): with pytest.raises(NotImplementedError): self.lexer.decode(b'') def test_final_backslash(self): self.lex_it( 'notsogood\\', 'n|o|t|s|o|g|o|o|d|\\'.split('|'), final=True ) def test_final_comment(self): self.lex_it( u'hello%', u'h|e|l|l|o|%'.split(u'|'), final=True ) def test_hash(self): self.lex_it(u'#', [u'#'], final=True) def test_tab(self): self.lex_it(u'\\c\tc', u'\\c|\t|c'.split(u'|'), final=True) def test_percent(self): self.lex_it(u'This is a \\% test.', u'T|h|i|s| |i|s| |a| |\\%| |t|e|s|t|.'.split(u'|'), final=True) self.lex_it(u'\\% %test', u'\\%| |%test'.split(u'|'), final=True) self.lex_it(u'\\% %test\nhi', u'\\%| |%test|\n|h|i'.split(u'|'), final=True) def test_double_quotes(self): self.lex_it(u"``a+b''", u"``|a|+|b|''".split(u'|'), final=True) T = TypeVar('T') class BaseLatexIncrementalDecoderTest(TestCase, Generic[T]): """Tex lexer fixture.""" errors = 'strict' IncrementalDecoder: Type[LatexIncrementalDecoder] def setUp(self): self.lexer = self.IncrementalDecoder(self.errors) def decode(self, input_: T, final: bool = False) -> str: raise NotImplementedError def lex_it(self, chars: str, latex_tokens: List[str], final: bool = False): tokens = self.lexer.get_tokens(chars, final=final) self.assertEqual( list(token.text for token in tokens), latex_tokens) def tearDown(self): del self.lexer class LatexIncrementalDecoderTest(BaseLatexIncrementalDecoderTest): IncrementalDecoder = LatexIncrementalDecoder def decode(self, input_: bytes, final: bool = False) -> str: return self.lexer.decode(input_, final) def test_null(self): self.lex_it(u'', [], final=True) def test_hello(self): self.lex_it( u'hello! [#1] This \\is\\ \\^ a \ntest.\n' u' \nHey.\n\n\\# x \\#x', r'h|e|l|l|o|!| |[|#1|]| |T|h|i|s| |\is|\ |\^|a| ' r'|t|e|s|t|.| |\par|H|e|y|.| ' r'|\par|\#| |x| |\#|x'.split(u'|'), final=True ) def test_comment(self): self.lex_it( u'test% some comment\ntest', u't|e|s|t|t|e|s|t'.split(u'|'), final=True ) def test_comment_newline(self): self.lex_it( u'test% some comment\n\ntest', u't|e|s|t|\\par|t|e|s|t'.split(u'|'), final=True ) def test_control(self): self.lex_it( u'\\hello\\world', u'\\hello|\\world'.split(u'|'), final=True ) def test_control_whitespace(self): self.lex_it( u'\\hello \\world ', u'\\hello|\\world'.split(u'|'), final=True ) def test_controlx(self): self.lex_it( u'\\#\\&', u'\\#|\\&'.split(u'|'), final=True ) def test_controlx_whitespace(self): self.lex_it( u'\\# \\& ', u'\\#| |\\&| '.split(u'|'), final=True ) def test_buffer(self): self.lex_it( u'hi\\t', u'h|i'.split(u'|'), ) self.lex_it( u'here', [u'\\there'], final=True, ) def test_buffer_decode(self): self.assertEqual( self.decode(b'hello! [#1] This \\i'), u'hello! [#1] This ', ) self.assertEqual( self.decode(b's\\ \\^ a \ntest.\n'), u'\\is \\ \\^a test.', ) self.assertEqual( self.decode(b' \nHey.\n\n\\# x \\#x', final=True), u' \\par Hey. \\par \\# x \\#x', ) def test_state_middle(self): self.lex_it( u'hi\\t', u'h|i'.split(u'|'), ) state = self.lexer.getstate() self.assertEqual(self.lexer.state, 'M') self.assertEqual(self.lexer.raw_buffer.name, 'control_word') self.assertEqual(self.lexer.raw_buffer.text, u'\\t') self.lexer.reset() self.assertEqual(self.lexer.state, 'N') self.assertEqual(self.lexer.raw_buffer.name, 'unknown') self.assertEqual(self.lexer.raw_buffer.text, u'') self.lex_it( u'here', u'h|e|r|e'.split(u'|'), final=True, ) self.lexer.setstate(state) self.assertEqual(self.lexer.state, 'M') self.assertEqual(self.lexer.raw_buffer.name, 'control_word') self.assertEqual(self.lexer.raw_buffer.text, u'\\t') self.lex_it( u'here', [u'\\there'], final=True, ) def test_state_inline_math(self): self.lex_it( u'hi$t', u'h|i|$'.split(u'|'), ) assert self.lexer.inline_math self.lex_it( u'here$', u't|h|e|r|e|$'.split(u'|'), final=True, ) assert not self.lexer.inline_math # counterintuitive? def test_final_backslash(self): with pytest.raises(UnicodeDecodeError): self.lex_it( u'notsogood\\', [u'notsogood'], final=True ) def test_final_comment(self): self.lex_it( u'hello%', u'h|e|l|l|o'.split(u'|'), final=True ) def test_hash(self): self.lex_it(u'#', [u'#'], final=True) def test_tab(self): self.lex_it(u'\\c\tc', u'\\c|c'.split(u'|'), final=True) class UnicodeLatexIncrementalDecoderTest(LatexIncrementalDecoderTest): def decode(self, input_: bytes, final: bool = False) -> str: return self.lexer.udecode(input_.decode("ascii"), final) class LatexIncrementalDecoderReplaceTest(BaseLatexIncrementalDecoderTest): errors = 'replace' IncrementalDecoder = MockIncrementalDecoder def test_errors_replace(self): self.lex_it( u'helmocklo', u'\ufffd|\ufffd|\ufffd|mock|\ufffd|\ufffd'.split(u'|'), final=True ) class LatexIncrementalDecoderIgnoreTest(BaseLatexIncrementalDecoderTest): errors = 'ignore' IncrementalDecoder = MockIncrementalDecoder def test_errors_ignore(self): self.lex_it( u'helmocklo', u'mock'.split(u'|'), final=True ) class LatexIncrementalDecoderInvalidErrorTest(BaseLatexIncrementalDecoderTest): errors = '**baderror**' IncrementalDecoder = MockIncrementalDecoder def test_errors_invalid(self): with pytest.raises(NotImplementedError): self.lex_it( u'helmocklo', u'?|?|?|mock|?|?'.split(u'|'), final=True ) class InvalidTokenLatexIncrementalDecoder(LatexIncrementalDecoder): """Decoder which results in invalid tokens.""" def get_raw_tokens(self, chars: str, final: bool = False ) -> Iterator[Token]: return iter([Token('**invalid**', chars)]) def test_invalid_token(): lexer = InvalidTokenLatexIncrementalDecoder() with pytest.raises(AssertionError): lexer.decode(b'hello') def test_invalid_state_1(): lexer = LatexIncrementalDecoder() # piggyback invalid state lexer.state = '**invalid**' with pytest.raises(AssertionError): lexer.decode(b'\n\n\n') def test_invalid_state_2(): lexer = LatexIncrementalDecoder() # piggyback invalid state lexer.state = '**invalid**' with pytest.raises(AssertionError): lexer.decode(b' ') class MyLatexIncrementalLexer(LatexIncrementalLexer): """A mock decoder to test the lexer.""" def decode(self, input_: bytes, final: bool = False) -> str: return '' # pragma: no cover class LatexIncrementalLexerTest(TestCase): errors = 'strict' def setUp(self): self.lexer = MyLatexIncrementalLexer(errors=self.errors) def lex_it(self, latex_code, latex_tokens, final=False): tokens = self.lexer.get_tokens(latex_code, final=final) self.assertEqual( list(token.text for token in tokens), latex_tokens) def tearDown(self): del self.lexer def test_newline(self): self.lex_it( u"hello\nworld", u"h|e|l|l|o| |w|o|r|l|d".split(u'|'), final=True) def test_par(self): self.lex_it( u"hello\n\nworld", u"h|e|l|l|o| |\\par|w|o|r|l|d".split(u'|'), final=True) class LatexIncrementalEncoderTest(TestCase): """Encoder test fixture.""" errors = 'strict' IncrementalEncoder = LatexIncrementalEncoder def setUp(self): self.encoder = self.IncrementalEncoder(self.errors) def encode(self, chars: str, latex_bytes: bytes, final=False): result = self.encoder.encode(chars, final=final) self.assertEqual(result, latex_bytes) def tearDown(self): del self.encoder def test_invalid_type(self): with pytest.raises(TypeError): self.encoder.encode(object(), final=True) # type: ignore def test_invalid_code(self): with pytest.raises(ValueError): # default encoding is ascii, \u00ff is not ascii translatable self.encoder.encode(u"\u00ff", final=True) def test_hello(self): self.encode(u'hello', b'hello', final=True) def test_unicode_tokens(self): self.assertEqual( list(self.encoder.get_unicode_tokens( u"ĄąĄ̊ą̊ĘęĮįǪǫǬǭŲųY̨y̨", final=True)), u"Ą|ą|Ą̊|ą̊|Ę|ę|Į|į|Ǫ|ǫ|Ǭ|ǭ|Ų|ų|Y̨|y̨".split(u"|")) def test_state(self): self.assertEqual( list(self.encoder.get_unicode_tokens( u"Ą", final=False)), []) state = self.encoder.getstate() self.encoder.reset() self.assertEqual( list(self.encoder.get_unicode_tokens( u"ABC", final=True)), [u"A", u"B", u"C"]) self.encoder.setstate(state) self.assertEqual( list(self.encoder.get_unicode_tokens( u"̊", final=True)), [u"Ą̊"]) class UnicodeLatexIncrementalEncoderTest(LatexIncrementalEncoderTest): def encode(self, chars: str, latex_bytes: bytes, final: bool = False): result = self.encoder.uencode(chars, final=final) self.assertEqual(result, latex_bytes.decode('ascii'))