././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7860878 w3lib-2.3.1/0000755000175100001660000000000014745713217012153 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/LICENSE0000644000175100001660000000277314745713200013161 0ustar00runnerdockerCopyright (c) w3lib and Scrapy developers. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of Scrapy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/MANIFEST.in0000644000175100001660000000040714745713200013702 0ustar00runnerdocker# Include tests into distribution recursive-include tests *.py *.txt # Include documentation source recursive-include docs Makefile make.bat conf.py *.rst # Miscellaneous assets include LICENSE include NEWS include README.rst include pytest.ini include tox.ini ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/NEWS0000644000175100001660000002706114745713200012650 0ustar00runnerdockerw3lib release notes =================== 2.3.1 (2025-01-27) ------------------ - Fix a merge error, no code changes. 2.3.0 (2025-01-27) ------------------ - Dropped Python 3.8 support (#232). - Removed the following functions, deprecated in 2.0.0: - ``w3lib.util.str_to_unicode`` - ``w3lib.util.to_native_str`` - ``w3lib.util.unicode_to_str`` (#235). - Added Python 3.13 support (#232). - Fixed running tests with newer point releases of Python 3.10 and 3.11 (#233). - Cleanup and CI improvements (#232, #234). 2.2.1 (2024-06-12) ------------------ - :func:`~w3lib.url.canonicalize_url` no longer applies lowercase to the userinfo URL component. (#229, #230) 2.2.0 (2024-06-05) ------------------ - Dropped Python 3.7 support (#214). - Added Python 3.12 and PyPy 3.10 support (#218). - Added the description to the package metadata (#227). - Improved type hints (#226). - Added ``.readthedocs.yml`` (#219). - Updated the intersphinx URLs (#224). - Added the ``pre-commit`` configuration, code reformatted with ``black`` (#220). - Updated CI configuration (#217, #227). 2.1.2 (2023-08-03) ------------------ - Fix test failures on Python 3.11.4+ (#212, #213). - Fix an incorrect type hint (#211). - Add project URLs to setup.py (#215). 2.1.1 (2022-12-09) ------------------ - :func:`~w3lib.url.safe_url_string`, :func:`~w3lib.url.safe_download_url` and :func:`~w3lib.url.canonicalize_url` now strip whitespace and control characters urls according to the URL living standard. 2.1.0 (2022-11-28) ------------------ - Dropped Python 3.6 support, and made Python 3.11 support official. (#195, #200) - :func:`~w3lib.url.safe_url_string` now generates safer URLs. To make URLs safer for the `URL living standard`_: .. _URL living standard: https://url.spec.whatwg.org/ - ``;=`` are percent-encoded in the URL username. - ``;:=`` are percent-encoded in the URL password. - ``'`` is percent-encoded in the URL query if the URL scheme is `special `__. To make URLs safer for `RFC 2396`_ and `RFC 3986`_, ``|[]`` are percent-encoded in URL paths, queries, and fragments. .. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt .. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt (#80, #203) - :func:`~w3lib.encoding.html_to_unicode` now checks for the `byte order mark`_ before inspecting the ``Content-Type`` header when determining the content encoding, in line with the `URL living standard`_. (#189, #191) .. _byte order mark: https://en.wikipedia.org/wiki/Byte_order_mark - :func:`~w3lib.url.canonicalize_url` now strips spaces from the input URL, to be more in line with the `URL living standard`_. (#132, #136) - :func:`~w3lib.html.get_base_url` now ignores HTML comments. (#70, #77) - Fixed :func:`~w3lib.url.safe_url_string` re-encoding percent signs on the URL username and password even when they were being used as part of an escape sequence. (#187, #196) - Fixed :func:`~w3lib.http.basic_auth_header` using the wrong flavor of base64 encoding, which could prevent authentication in rare cases. (#181, #192) - Fixed :func:`~w3lib.html.replace_entities` raising :exc:`OverflowError` in some cases due to `a bug in CPython `__. (#199, #202) - Improved typing and fixed typing issues. (#190, #206) - Made CI and test improvements. (#197, #198) - Adopted a Code of Conduct. (#194) 2.0.1 (2022-08-11) ------------------ Minor documentation fix (release date is set in the changelog). 2.0.0 (2022-08-11) ------------------ Backwards incompatible changes: - Python 2 is no longer supported; Python 3.6+ is required now (#168, #175). - :func:`w3lib.url.safe_url_string` and :func:`w3lib.url.canonicalize_url` no longer convert "%23" to "#" when it appears in the URL path. This is a bug fix. It's listed as a backward-incomatible change because in some cases the output of :func:`w3lib.url.canonicalize_url` is going to change, and so, if this output is used to generate URL fingerprints, new fingerprints might be incompatible with those created with the previous w3lib versions (#141). Deprecation removals (#169): - The ``w3lib.form`` module is removed. - The ``w3lib.html.remove_entities`` function is removed. - The ``w3lib.url.urljoin_rfc`` function is removed. The following functions are deprecated, and will be removed in future releases (#170): - ``w3lib.util.str_to_unicode`` - ``w3lib.util.unicode_to_str`` - ``w3lib.util.to_native_str`` Other improvements and bug fixes: - Type annotations are added (#172, #184). - Added support for Python 3.9 and 3.10 (#168, #176). - Fixed :func:`w3lib.html.get_meta_refresh` for ```` tags where ``http-equiv`` is written after ``content`` (#179). - Fixed :func:`w3lib.url.safe_url_string` for IDNA domains with ports (#174). - :func:`w3lib.url.url_query_cleaner` no longer adds an unneeded ``#`` when ``keep_fragments=True`` is passed, and the URL doesn't have a fragment (#159). - Removed a workaround for an ancient pathname2url bug (#142) - CI is migrated to GitHub Actions (#166, #177); other CI improvements (#160, #182). - The code is formatted using black (#173). 1.22.0 (2020-05-13) ------------------- - Python 3.4 is no longer supported (issue #156) - :func:`w3lib.url.safe_url_string` now supports an optional ``quote_path`` parameter to disable the percent-encoding of the URL path (issue #119) - :func:`w3lib.url.add_or_replace_parameter` and :func:`w3lib.url.add_or_replace_parameters` no longer remove duplicate parameters from the original query string that are not being added or replaced (issue #126) - :func:`w3lib.html.remove_tags` now raises a :exc:`ValueError` exception instead of :exc:`AssertionError` when using both the ``which_ones`` and the ``keep`` parameters (issue #154) - Test improvements (issues #143, #146, #148, #149) - Documentation improvements (issues #140, #144, #145, #151, #152, #153) - Code cleanup (issue #139) 1.21.0 (2019-08-09) ------------------- - Add the ``encoding`` and ``path_encoding`` parameters to :func:`w3lib.url.safe_download_url` (issue #118) - :func:`w3lib.url.safe_url_string` now also removes tabs and new lines (issue #133) - :func:`w3lib.html.remove_comments` now also removes truncated comments (issue #129) - :func:`w3lib.html.remove_tags_with_content` no longer removes tags which start with the same text as one of the specified tags (issue #114) - Recommend pytest instead of nose to run tests (issue #124) 1.20.0 (2019-01-11) ------------------- - Fix url_query_cleaner to do not append "?" to urls without a query string (issue #109) - Add support for Python 3.7 and drop Python 3.3 (issue #113) - Add `w3lib.url.add_or_replace_parameters` helper (issue #117) - Documentation fixes (issue #115) 1.19.0 (2018-01-25) ------------------- - Add a workaround for CPython segfault (https://bugs.python.org/issue32583) which affect w3lib.encoding functions. This is technically **backwards incompatible** because it changes the way non-decodable bytes are replaced (in some cases instead of two ``\ufffd`` chars you can get one). As a side effect, the fix speeds up decoding in Python 3.4+. - Add 'encoding' parameter for w3lib.http.basic_auth_header. - Fix pypy testing setup, add pypy3 to CI. 1.18.0 (2017-08-03) ------------------- - Include additional assets used for distribution packages in the source tarball - Consider ``[`` and ``]`` as safe characters in path and query components of URLs, i.e. they are not escaped anymore - Disable codecov project coverage check 1.17.0 (2017-02-08) ------------------- - Add Python 3.5 and 3.6 support - Add ``w3lib.url.parse_data_uri`` helper for parsing "data:" URIs - Add ``w3lib.html.strip_html5_whitespace`` function to strip leading and trailing whitespace as per W3C recommendations, e.g. for cleaning "href" attribute values - Fix ``w3lib.http.headers_raw_to_dict`` for multiple headers with same name - Do not distribute tests/test_*.pyc artifacts 1.16.0 (2016-11-10) ------------------- - ``canonicalize_url()`` and ``safe_url_string()``: strip ":" when no port is specified (as per `RFC 3986`_; see also https://github.com/scrapy/scrapy/issues/2377) - ``url_query_cleaner()``: support new ``keep_fragments`` argument (defaulting to ``False``) 1.15.0 (2016-07-29) ------------------- - Add ``canonicalize_url()`` to ``w3lib.url`` 1.14.3 (2016-07-14) ------------------- Bugfix release: - Handle IDNA encoding failures in ``safe_url_string()`` (issue #62) 1.14.2 (2016-04-11) ------------------- Bugfix release: - fix function import for (deprecated) ``urljoin_rfc`` (issue #51) - only expose wanted functions from ``w3lib.url``, via ``__all__`` (see issue #54, https://github.com/scrapy/scrapy/issues/1917) 1.14.1 (2016-04-07) ------------------- Bugfix release: - For bytes URLs, when supplied encoding (or default UTF8) is wrong, ``safe_url_string`` falls back to percent-encoding offending bytes. 1.14.0 (2016-04-06) ------------------- Changes to safe_url_string: - proper handling of non-ASCII characters in Python2 and Python3 - support IDNs - new `path_encoding` to override default UTF-8 when serializing non-ASCII characters before percent-encoding html_body_declared_encoding also detects encoding when not sole attribute in ````. Package is now properly marked as ``zip_safe``. 1.13.0 (2015-11-05) ------------------- - remove_tags removes uppercase tags as well; - ignore meta-redirects inside script or noscript tags by default, but add an option to not ignore them; - replace_entities now handles entities without trailing semicolon; - fixed uncaught UnicodeDecodeError when decoding entities. 1.12.0 (2015-06-29) ------------------- - meta_refresh regex now handles leading newlines and whitespaces in the url; - include tests folder in source distribution. 1.11.0 (2015-01-13) ------------------- - url_query_cleaner now supports str or list parameters; - add support for resolving base URLs in tags with attributes before href. 1.10.0 (2014-08-20) ------------------- - reverted all 1.9.0 changes. 1.9.0 (2014-08-16) ------------------ - all url-related functions accept bytes and unicode and now return bytes. 1.8.1 (2014-08-14) ------------------ - w3lib.http.basic_auth_header now returns bytes 1.8.0 (2014-07-31) ------------------ - add support for big5-hkscs encoding. 1.7.1 (2014-07-26) ------------------ - PY3 fixed headers_raw_to_dict and headers_dict_to_raw; - documentation improvements; - provide wheels. 1.6 (2014-06-03) ---------------- - w3lib.form.encode_multipart is deprecated; - docstrings and docs are improved; - w3lib.url.add_or_replace_parameter is re-implemented on top of stdlib functions; - remove_entities is renamed to replace_entities. 1.5 (2013-11-09) ---------------- - Python 2.6 support is dropped. 1.4 (2013-10-18) ---------------- - Python 3 support; - get_meta_refresh encoding handling is fixed; - check for '?' in add_or_replace_parameter; - ISO-8859-1 is used for HTTP Basic Auth; - fixed unicode handling in replace_escape_chars; 1.3 (2012-05-13) ---------------- - support non-standard gb_2312_80 encoding; - drop Python 2.5 support. 1.2 (2012-05-02) ---------------- - Detect encoding for content attr before http-equiv in meta tag. 1.1 (2012-04-18) ---------------- - w3lib.html.remove_comments handles multiline comments; - Added w3lib.encoding module, containing functions for working with character encoding, like encoding autodetection from HTML pages. - w3lib.url.urljoin_rfc is deprecated. 1.0 (2011-04-17) ---------------- First release of w3lib. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7860878 w3lib-2.3.1/PKG-INFO0000644000175100001660000000441414745713217013253 0ustar00runnerdockerMetadata-Version: 2.2 Name: w3lib Version: 2.3.1 Summary: Library of web-related functions Home-page: https://github.com/scrapy/w3lib Author: Scrapy project Author-email: info@scrapy.org License: BSD Project-URL: Documentation, https://w3lib.readthedocs.io/en/latest/ Project-URL: Source Code, https://github.com/scrapy/w3lib Project-URL: Issue Tracker, https://github.com/scrapy/w3lib/issues Platform: Any Classifier: Development Status :: 5 - Production/Stable Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Programming Language :: Python :: 3.13 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Internet :: WWW/HTTP Requires-Python: >=3.9 Description-Content-Type: text/x-rst License-File: LICENSE Dynamic: author Dynamic: author-email Dynamic: classifier Dynamic: description Dynamic: description-content-type Dynamic: home-page Dynamic: license Dynamic: platform Dynamic: project-url Dynamic: requires-python Dynamic: summary ===== w3lib ===== .. image:: https://github.com/scrapy/w3lib/actions/workflows/tests.yml/badge.svg :target: https://github.com/scrapy/w3lib/actions .. image:: https://img.shields.io/codecov/c/github/scrapy/w3lib/master.svg :target: http://codecov.io/github/scrapy/w3lib?branch=master :alt: Coverage report Overview ======== This is a Python library of web-related functions, such as: * remove comments, or tags from HTML snippets * extract base url from HTML snippets * translate entites on HTML strings * convert raw HTTP headers to dicts and vice-versa * construct HTTP auth header * converting HTML pages to unicode * sanitize urls (like browsers do) * extract arguments from urls Requirements ============ Python 3.9+ Install ======= ``pip install w3lib`` Documentation ============= See http://w3lib.readthedocs.org/ License ======= The w3lib library is licensed under the BSD license. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/README.rst0000644000175100001660000000162214745713200013633 0ustar00runnerdocker===== w3lib ===== .. image:: https://github.com/scrapy/w3lib/actions/workflows/tests.yml/badge.svg :target: https://github.com/scrapy/w3lib/actions .. image:: https://img.shields.io/codecov/c/github/scrapy/w3lib/master.svg :target: http://codecov.io/github/scrapy/w3lib?branch=master :alt: Coverage report Overview ======== This is a Python library of web-related functions, such as: * remove comments, or tags from HTML snippets * extract base url from HTML snippets * translate entites on HTML strings * convert raw HTTP headers to dicts and vice-versa * construct HTTP auth header * converting HTML pages to unicode * sanitize urls (like browsers do) * extract arguments from urls Requirements ============ Python 3.9+ Install ======= ``pip install w3lib`` Documentation ============= See http://w3lib.readthedocs.org/ License ======= The w3lib library is licensed under the BSD license. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7810876 w3lib-2.3.1/docs/0000755000175100001660000000000014745713217013103 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/docs/Makefile0000644000175100001660000001267014745713200014541 0ustar00runnerdocker# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" clean: -rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/w3lib.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/w3lib.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/w3lib" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/w3lib" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/docs/conf.py0000644000175100001660000002054314745713200014376 0ustar00runnerdocker# w3lib documentation build configuration file, created by # sphinx-quickstart on Sun Jan 26 22:19:38 2014. # # This file is execfile()d with the current directory set to its containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import os import sys # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath("..")) # -- General configuration ----------------------------------------------------- # If your documentation needs a minimal Sphinx version, state it here. # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be extensions # coming with Sphinx (named 'sphinx.ext.*') or your custom ones. extensions = [ "hoverxref.extension", "notfound.extension", "sphinx.ext.autodoc", "sphinx.ext.doctest", "sphinx.ext.intersphinx", "sphinx.ext.viewcode", ] # Add any paths that contain templates here, relative to this directory. templates_path = ["_templates"] # The suffix of source filenames. source_suffix = {".rst": "restructuredtext"} # The encoding of source files. # source_encoding = 'utf-8-sig' # The master toctree document. master_doc = "index" # General information about the project. project = "w3lib" copyright = "2014, w3lib developers" # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The full version, including alpha/beta/rc tags. release = "2.3.1" # The short X.Y version. version = ".".join(release.split(".")[:2]) # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: # today = '' # Else, today_fmt is used as the format for a strftime call. # today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ["_build"] # The reST default role (used for this markup: `text`) to use for all documents. # default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. # add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). # add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. # show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = "sphinx" # A list of ignored prefixes for module index sorting. # modindex_common_prefix = [] # -- Options for HTML output --------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = "sphinx_rtd_theme" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. # html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". # html_title = None # A shorter title for the navigation bar. Default is the same as html_title. # html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. # html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. # html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. # html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. # html_use_smartypants = True # Custom sidebar templates, maps document names to template names. # html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. # html_additional_pages = {} # If false, no module index is generated. # html_domain_indices = True # If false, no index is generated. # html_use_index = True # If true, the index is split into individual pages for each letter. # html_split_index = False # If true, links to the reST sources are added to the pages. # html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. # html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. # html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. # html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). # html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = "w3libdoc" # -- Options for LaTeX output -------------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). # 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # 'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, author, documentclass [howto/manual]). latex_documents = [ ("index", "w3lib.tex", "w3lib Documentation", "w3lib developers", "manual"), ] # The name of an image file (relative to this directory) to place at the top of # the title page. # latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. # latex_use_parts = False # If true, show page references after internal links. # latex_show_pagerefs = False # If true, show URL addresses after external links. # latex_show_urls = False # Documents to append as an appendix to all manuals. # latex_appendices = [] # If false, no module index is generated. # latex_domain_indices = True # -- Options for manual page output -------------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [("index", "w3lib", "w3lib Documentation", ["w3lib developers"], 1)] # If true, show URL addresses after external links. # man_show_urls = False # -- Options for Texinfo output ------------------------------------------------ # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ( "index", "w3lib", "w3lib Documentation", "w3lib developers", "w3lib", "One line description of project.", "Miscellaneous", ), ] # Documents to append as an appendix to all manuals. # texinfo_appendices = [] # If false, no module index is generated. # texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. # texinfo_show_urls = 'footnote' # Example configuration for intersphinx: refer to the Python standard library. intersphinx_mapping = { "pytest": ("https://docs.pytest.org/en/latest", None), "python": ("https://docs.python.org/3", None), "scrapy": ("https://docs.scrapy.org/en/latest", None), "tox": ("https://tox.wiki/en/latest", None), } # -- Nitpicking options ------------------------------------------------------- nitpicky = True # -- sphinx-hoverxref options ------------------------------------------------- hoverxref_auto_ref = True ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/docs/index.rst0000644000175100001660000000267014745713200014741 0ustar00runnerdockerWelcome to w3lib's documentation! ================================= Overview ======== This is a Python library of web-related functions, such as: * remove comments, or tags from HTML snippets * extract base url from HTML snippets * translate entities on HTML strings * convert raw HTTP headers to dicts and vice-versa * construct HTTP auth header * converting HTML pages to unicode * sanitize urls (like browsers do) * extract arguments from urls The w3lib library is licensed under the BSD license. Modules ======= .. toctree:: :maxdepth: 4 w3lib Requirements ============ Python 3.9+ Install ======= ``pip install w3lib`` Tests ===== :doc:`pytest ` is the preferred way to run tests. Just run: ``pytest`` from the root directory to execute tests using the default Python interpreter. :doc:`tox ` could be used to run tests for all supported Python versions. Install it (using 'pip install tox') and then run ``tox`` from the root directory - tests will be executed for all available Python interpreters. Changelog ========= .. include:: ../NEWS :start-line: 3 History ------- The code of w3lib was originally part of the :doc:`Scrapy framework ` but was later stripped out of Scrapy, with the aim of make it more reusable and to provide a useful library of web functions without depending on Scrapy. Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/docs/make.bat0000644000175100001660000001174614745713200014511 0ustar00runnerdocker@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=_build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . set I18NSPHINXOPTS=%SPHINXOPTS% . if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\w3lib.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\w3lib.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) :end ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/docs/w3lib.rst0000644000175100001660000000073314745713200014650 0ustar00runnerdockerw3lib Package ============= :mod:`~w3lib.encoding` Module ----------------------------- .. automodule:: w3lib.encoding :members: :mod:`~w3lib.html` Module ------------------------- .. automodule:: w3lib.html :members: :mod:`~w3lib.http` Module ------------------------- .. automodule:: w3lib.http :members: :mod:`~w3lib.url` Module ------------------------ .. automodule:: w3lib.url :members: .. autoclass:: ParseDataURIResult :members: ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/pytest.ini0000644000175100001660000000007114745713200014172 0ustar00runnerdocker[pytest] doctest_optionflags = ALLOW_UNICODE ALLOW_BYTES ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7860878 w3lib-2.3.1/setup.cfg0000644000175100001660000000004614745713217013774 0ustar00runnerdocker[egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/setup.py0000644000175100001660000000304314745713200013655 0ustar00runnerdockerfrom setuptools import find_packages, setup with open("README.rst", encoding="utf-8") as f: long_description = f.read() setup( name="w3lib", version="2.3.1", license="BSD", description="Library of web-related functions", long_description=long_description, long_description_content_type="text/x-rst", author="Scrapy project", author_email="info@scrapy.org", url="https://github.com/scrapy/w3lib", project_urls={ "Documentation": "https://w3lib.readthedocs.io/en/latest/", "Source Code": "https://github.com/scrapy/w3lib", "Issue Tracker": "https://github.com/scrapy/w3lib/issues", }, packages=find_packages(exclude=("tests", "tests.*")), package_data={ "w3lib": ["py.typed"], }, include_package_data=True, zip_safe=False, platforms=["Any"], python_requires=">=3.9", classifiers=[ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.9", "Programming Language :: Python :: 3.10", "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", "Programming Language :: Python :: 3.13", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Internet :: WWW/HTTP", ], ) ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7830877 w3lib-2.3.1/tests/0000755000175100001660000000000014745713217013315 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/__init__.py0000644000175100001660000000000014745713200015404 0ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/test_encoding.py0000644000175100001660000003051414745713200016507 0ustar00runnerdockerfrom __future__ import annotations import codecs import unittest from typing import Any from w3lib.encoding import ( html_body_declared_encoding, html_to_unicode, http_content_type_encoding, read_bom, resolve_encoding, to_unicode, ) class RequestEncodingTests(unittest.TestCase): utf8_fragments = [ # Content-Type as meta http-equiv b"""""", b"""\n""", b"""""", b"""""", b"""""", b"""""", b""" bad html still supported < meta http-equiv='Content-Type'\n content="text/html; charset=utf-8">""", # html5 meta charset b"""""", b"""""", # xml encoding b"""""", ] def test_bom(self): # cjk water character in unicode water_unicode = "\u6C34" # BOM + water character encoded utf16be = b"\xfe\xff\x6c\x34" utf16le = b"\xff\xfe\x34\x6c" utf32be = b"\x00\x00\xfe\xff\x00\x00\x6c\x34" utf32le = b"\xff\xfe\x00\x00\x34\x6c\x00\x00" for string in (utf16be, utf16le, utf32be, utf32le): bom_encoding, bom = read_bom(string) assert bom_encoding is not None assert bom is not None decoded = string[len(bom) :].decode(bom_encoding) self.assertEqual(water_unicode, decoded) # Body without BOM enc, bom = read_bom(b"foo") self.assertEqual(enc, None) self.assertEqual(bom, None) # Empty body enc, bom = read_bom(b"") self.assertEqual(enc, None) self.assertEqual(bom, None) def test_http_encoding_header(self): header_value = "Content-Type: text/html; charset=ISO-8859-4" extracted = http_content_type_encoding(header_value) self.assertEqual(extracted, "iso8859-4") self.assertEqual(None, http_content_type_encoding("something else")) def test_html_body_declared_encoding(self): for fragment in self.utf8_fragments: encoding = html_body_declared_encoding(fragment) self.assertEqual(encoding, "utf-8", fragment) self.assertEqual(None, html_body_declared_encoding(b"something else")) self.assertEqual( None, html_body_declared_encoding( b""" this isn't searched """ ), ) self.assertEqual( None, html_body_declared_encoding( b"""""" ), ) def test_html_body_declared_encoding_unicode(self): # html_body_declared_encoding should work when unicode body is passed self.assertEqual(None, html_body_declared_encoding("something else")) for fragment in self.utf8_fragments: encoding = html_body_declared_encoding(fragment.decode("utf8")) self.assertEqual(encoding, "utf-8", fragment) self.assertEqual( None, html_body_declared_encoding( """ this isn't searched """ ), ) self.assertEqual( None, html_body_declared_encoding( """""" ), ) class CodecsEncodingTestCase(unittest.TestCase): def test_resolve_encoding(self): self.assertEqual(resolve_encoding("latin1"), "cp1252") self.assertEqual(resolve_encoding(" Latin-1"), "cp1252") self.assertEqual(resolve_encoding("gb_2312-80"), "gb18030") self.assertEqual(resolve_encoding("unknown encoding"), None) class UnicodeDecodingTestCase(unittest.TestCase): def test_utf8(self): self.assertEqual(to_unicode(b"\xc2\xa3", "utf-8"), "\xa3") def test_invalid_utf8(self): self.assertEqual(to_unicode(b"\xc2\xc2\xa3", "utf-8"), "\ufffd\xa3") def ct(charset: str | None) -> str | None: return "Content-Type: text/html; charset=" + charset if charset else None def norm_encoding(enc: str) -> str: return codecs.lookup(enc).name class HtmlConversionTests(unittest.TestCase): def test_unicode_body(self): unicode_string = "\u043a\u0438\u0440\u0438\u043b\u043b\u0438\u0447\u0435\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442" original_string = unicode_string.encode("cp1251") encoding, body_unicode = html_to_unicode(ct("cp1251"), original_string) # check body_as_unicode self.assertTrue(isinstance(body_unicode, str)) self.assertEqual(body_unicode, unicode_string) def _assert_encoding( self, content_type: str | None, body: bytes, expected_encoding: str, expected_unicode: str | list[str], ) -> None: assert not isinstance(body, str) encoding, body_unicode = html_to_unicode(ct(content_type), body) self.assertTrue(isinstance(body_unicode, str)) self.assertEqual(norm_encoding(encoding), norm_encoding(expected_encoding)) if isinstance(expected_unicode, str): self.assertEqual(body_unicode, expected_unicode) else: self.assertTrue( body_unicode in expected_unicode, f"{body_unicode} is not in {expected_unicode}", ) def test_content_type_and_conversion(self): """Test content type header is interpreted and text converted as expected """ self._assert_encoding("utf-8", b"\xc2\xa3", "utf-8", "\xa3") # something like this in the scrapy tests - but that's invalid? # self._assert_encoding('', "\xa3", 'utf-8', "\xa3") # iso-8859-1 is overridden to cp1252 self._assert_encoding("iso-8859-1", b"\xa3", "cp1252", "\xa3") self._assert_encoding("", b"\xc2\xa3", "utf-8", "\xa3") self._assert_encoding("none", b"\xc2\xa3", "utf-8", "\xa3") self._assert_encoding("gb2312", b"\xa8D", "gb18030", "\u2015") self._assert_encoding("gbk", b"\xa8D", "gb18030", "\u2015") self._assert_encoding("big5", b"\xf9\xda", "big5hkscs", "\u6052") def test_invalid_utf8_encoded_body_with_valid_utf8_BOM(self): # unlike scrapy, the BOM is stripped self._assert_encoding( "utf-8", b"\xef\xbb\xbfWORD\xe3\xabWORD2", "utf-8", "WORD\ufffdWORD2" ) self._assert_encoding( None, b"\xef\xbb\xbfWORD\xe3\xabWORD2", "utf-8", "WORD\ufffdWORD2" ) def test_utf8_unexpected_end_of_data_with_valid_utf8_BOM(self): # Python implementations handle unexpected end of UTF8 data # differently (see https://bugs.pypy.org/issue1536). # It is hard to fix this for PyPy in w3lib, so the test # is permissive. # unlike scrapy, the BOM is stripped self._assert_encoding( "utf-8", b"\xef\xbb\xbfWORD\xe3\xab", "utf-8", ["WORD\ufffd\ufffd", "WORD\ufffd"], ) self._assert_encoding( None, b"\xef\xbb\xbfWORD\xe3\xab", "utf-8", ["WORD\ufffd\ufffd", "WORD\ufffd"], ) def test_replace_wrong_encoding(self): """Test invalid chars are replaced properly""" encoding, body_unicode = html_to_unicode(ct("utf-8"), b"PREFIX\xe3\xabSUFFIX") # XXX: Policy for replacing invalid chars may suffer minor variations # but it should always contain the unicode replacement char ('\ufffd') assert "\ufffd" in body_unicode, repr(body_unicode) assert "PREFIX" in body_unicode, repr(body_unicode) assert "SUFFIX" in body_unicode, repr(body_unicode) # Do not destroy html tags due to encoding bugs encoding, body_unicode = html_to_unicode(ct("utf-8"), b"\xf0value") assert "value" in body_unicode, repr(body_unicode) def _assert_encoding_detected( self, content_type: str | None, expected_encoding: str, body: bytes, **kwargs: Any, ) -> None: assert not isinstance(body, str) encoding, body_unicode = html_to_unicode(ct(content_type), body, **kwargs) self.assertTrue(isinstance(body_unicode, str)) self.assertEqual(norm_encoding(encoding), norm_encoding(expected_encoding)) def test_BOM(self): # utf-16 cases already tested, as is the BOM detection function # BOM takes precedence, ahead of the http header bom_be_str = codecs.BOM_UTF16_BE + "hi".encode("utf-16-be") expected = "hi" self._assert_encoding("utf-8", bom_be_str, "utf-16-be", expected) # BOM is stripped when present bom_utf8_str = codecs.BOM_UTF8 + b"hi" self._assert_encoding("utf-8", bom_utf8_str, "utf-8", "hi") self._assert_encoding(None, bom_utf8_str, "utf-8", "hi") def test_utf16_32(self): # tools.ietf.org/html/rfc2781 section 4.3 # USE BOM and strip it bom_be_str = codecs.BOM_UTF16_BE + "hi".encode("utf-16-be") self._assert_encoding("utf-16", bom_be_str, "utf-16-be", "hi") self._assert_encoding(None, bom_be_str, "utf-16-be", "hi") bom_le_str = codecs.BOM_UTF16_LE + "hi".encode("utf-16-le") self._assert_encoding("utf-16", bom_le_str, "utf-16-le", "hi") self._assert_encoding(None, bom_le_str, "utf-16-le", "hi") bom_be_str = codecs.BOM_UTF32_BE + "hi".encode("utf-32-be") self._assert_encoding("utf-32", bom_be_str, "utf-32-be", "hi") self._assert_encoding(None, bom_be_str, "utf-32-be", "hi") bom_le_str = codecs.BOM_UTF32_LE + "hi".encode("utf-32-le") self._assert_encoding("utf-32", bom_le_str, "utf-32-le", "hi") self._assert_encoding(None, bom_le_str, "utf-32-le", "hi") # if there is no BOM, big endian should be chosen self._assert_encoding("utf-16", "hi".encode("utf-16-be"), "utf-16-be", "hi") self._assert_encoding("utf-32", "hi".encode("utf-32-be"), "utf-32-be", "hi") def test_python_crash(self): import random from io import BytesIO random.seed(42) buf = BytesIO() for i in range(150000): buf.write(bytes([random.randint(0, 255)])) to_unicode(buf.getvalue(), "utf-16-le") to_unicode(buf.getvalue(), "utf-16-be") to_unicode(buf.getvalue(), "utf-32-le") to_unicode(buf.getvalue(), "utf-32-be") def test_html_encoding(self): # extracting the encoding from raw html is tested elsewhere body = b"""blah blah < meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> other stuff""" self._assert_encoding_detected(None, "cp1252", body) # header encoding takes precedence self._assert_encoding_detected("utf-8", "utf-8", body) # BOM encoding takes precedence self._assert_encoding_detected(None, "utf-8", codecs.BOM_UTF8 + body) def test_autodetect(self): def asciif(x): return "ascii" body = b"""""" # body encoding takes precedence self._assert_encoding_detected(None, "utf-8", body, auto_detect_fun=asciif) # if no other encoding, the auto detect encoding is used. self._assert_encoding_detected( None, "ascii", b"no encoding info", auto_detect_fun=asciif ) def test_default_encoding(self): # if no other method available, the default encoding of utf-8 is used self._assert_encoding_detected(None, "utf-8", b"no encoding info") # this can be overridden self._assert_encoding_detected( None, "ascii", b"no encoding info", default_encoding="ascii" ) def test_empty_body(self): # if no other method available, the default encoding of utf-8 is used self._assert_encoding_detected(None, "utf-8", b"") ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/test_html.py0000644000175100001660000006057214745713200015674 0ustar00runnerdockerimport unittest from w3lib.html import ( get_base_url, get_meta_refresh, remove_comments, remove_tags, remove_tags_with_content, replace_entities, replace_escape_chars, replace_tags, unquote_markup, ) class RemoveEntitiesTest(unittest.TestCase): def test_returns_unicode(self): # make sure it always return uncode assert isinstance(replace_entities(b"no entities"), str) assert isinstance(replace_entities(b"Price: £100!"), str) assert isinstance(replace_entities("no entities"), str) assert isinstance(replace_entities("Price: £100!"), str) def test_regular(self): # regular conversions self.assertEqual(replace_entities("As low as £100!"), "As low as \xa3100!") self.assertEqual( replace_entities(b"As low as £100!"), "As low as \xa3100!" ) self.assertEqual( replace_entities( "redirectTo=search&searchtext=MR0221Y&aff=buyat&affsrc=d_data&cm_mmc=buyat-_-ELECTRICAL & SEASONAL-_-MR0221Y-_-9-carat gold ½oz solid crucifix pendant" ), "redirectTo=search&searchtext=MR0221Y&aff=buyat&affsrc=d_data&cm_mmc=buyat-_-ELECTRICAL & SEASONAL-_-MR0221Y-_-9-carat gold \xbdoz solid crucifix pendant", ) def test_keep_entities(self): # keep some entities self.assertEqual( replace_entities( b"Low < High & Medium £ six", keep=["lt", "amp"] ), "Low < High & Medium \xa3 six", ) self.assertEqual( replace_entities( "Low < High & Medium £ six", keep=["lt", "amp"] ), "Low < High & Medium \xa3 six", ) def test_illegal_entities(self): self.assertEqual( replace_entities( "a < b &illegal; c � six", remove_illegal=False ), "a < b &illegal; c � six", ) self.assertEqual( replace_entities( "a < b &illegal; c � six", remove_illegal=True ), "a < b c six", ) self.assertEqual(replace_entities("x≤y"), "x\u2264y") self.assertEqual(replace_entities("xy"), "xy") self.assertEqual(replace_entities("xy", remove_illegal=False), "xy") self.assertEqual(replace_entities("�"), "") self.assertEqual( replace_entities("�", remove_illegal=False), "�" ) def test_browser_hack(self): # check browser hack for numeric character references in the 80-9F range self.assertEqual(replace_entities("x™y", encoding="cp1252"), "x\u2122y") self.assertEqual(replace_entities("x™y", encoding="cp1252"), "x\u2122y") def test_missing_semicolon(self): for entity, result in ( ("<<!", "<some tag"), "This text contains some tag", ) self.assertEqual( replace_tags(b"This text is very important", " "), "This text is very im port ant", ) def test_replace_tags_multiline(self): self.assertEqual( replace_tags(b'Click here'), "Click here" ) class RemoveCommentsTest(unittest.TestCase): def test_returns_unicode(self): # make sure it always return unicode assert isinstance(remove_comments(b"without comments"), str) assert isinstance(remove_comments(b""), str) assert isinstance(remove_comments("without comments"), str) assert isinstance(remove_comments(""), str) def test_no_comments(self): # text without comments self.assertEqual( remove_comments("text without comments"), "text without comments" ) def test_remove_comments(self): # text with comments self.assertEqual(remove_comments(""), "") self.assertEqual(remove_comments("Hello"), "Hello") self.assertEqual(remove_comments("Hello"), "Hello") self.assertEqual( remove_comments(b"test whatever"), "test whatever" ) self.assertEqual( remove_comments(b"test whatever"), "test whatever" ) self.assertEqual(remove_comments(b"test """), "" ) self.assertEqual( get_base_url(""" """ ), "http://example_2.com/", ) self.assertEqual( get_base_url( """ """ ), "http://example_3.com/", ) def test_relative_url_with_absolute_path(self): baseurl = "https://example.org" text = """\ \ Dummy\ blahablsdfsal&\ """ self.assertEqual( get_base_url(text, baseurl), "https://example.org/absolutepath" ) def test_no_scheme_url(self): baseurl = "https://example.org" text = b"""\ \ Dummy\ blahablsdfsal&\ """ self.assertEqual(get_base_url(text, baseurl), "https://noscheme.com/path") def test_attributes_before_href(self): baseurl = "https://example.org" text = """\ \ Dummy\ blahablsdfsal&\ """ self.assertEqual(get_base_url(text, baseurl), "http://example.org/something") def test_tag_name(self): baseurl = "https://example.org" text = """\ \ Dummy\ blahablsdfsal&\ """ self.assertEqual(get_base_url(text, baseurl), "https://example.org") def test_get_base_url_utf8(self): baseurl = "https://example.org" text = """ Dummy blahablsdfsal& """ self.assertEqual( get_base_url(text, baseurl), "http://example.org/snowman%E2%8D%A8" ) def test_get_base_url_latin1(self): # page encoding does not affect URL path encoding before percent-escaping # we should still use UTF-8 by default baseurl = "https://example.org" text = """ Dummy blahablsdfsal& """ self.assertEqual( get_base_url(text, baseurl, encoding="latin-1"), "http://example.org/sterling%C2%A3", ) def test_get_base_url_latin1_percent(self): # non-UTF-8 percent-encoded characters sequence are left untouched baseurl = "https://example.org" text = """ Dummy blahablsdfsal& """ self.assertEqual(get_base_url(text, baseurl), "http://example.org/sterling%a3") class GetMetaRefreshTest(unittest.TestCase): def test_get_meta_refresh(self): baseurl = "http://example.org" body = """ Dummy blahablsdfsal& """ self.assertEqual( get_meta_refresh(body, baseurl), (5, "http://example.org/newpage") ) def test_without_url(self): # refresh without url should return (None, None) baseurl = "http://example.org" body = """""" self.assertEqual(get_meta_refresh(body, baseurl), (None, None)) body = """""" self.assertEqual( get_meta_refresh(body, baseurl), (5, "http://example.org/newpage") ) def test_multiline(self): # meta refresh in multiple lines baseurl = "http://example.org" body = """ """ self.assertEqual( get_meta_refresh(body, baseurl), (1, "http://example.org/newpage") ) def test_entities_in_redirect_url(self): # entities in the redirect url baseurl = "http://example.org" body = """""" self.assertEqual( get_meta_refresh(body, baseurl), (3, "http://www.example.com/other") ) def test_relative_redirects(self): # relative redirects baseurl = "http://example.com/page/this.html" body = """""" self.assertEqual( get_meta_refresh(body, baseurl), (3, "http://example.com/page/other.html") ) def test_nonascii_url_utf8(self): # non-ascii chars in the url (utf8 - default) baseurl = "http://example.com" body = b"""""" self.assertEqual( get_meta_refresh(body, baseurl), (3, "http://example.com/to%C2%A3") ) def test_nonascii_url_latin1(self): # non-ascii chars in the url path (latin1) # should end up UTF-8 encoded anyway baseurl = "http://example.com" body = b"""""" self.assertEqual( get_meta_refresh(body, baseurl, "latin1"), (3, "http://example.com/to%C2%A3"), ) def test_nonascii_url_latin1_query(self): # non-ascii chars in the url path and query (latin1) # only query part should be kept latin1 encoded before percent escaping baseurl = "http://example.com" body = b"""""" self.assertEqual( get_meta_refresh(body, baseurl, "latin1"), (3, "http://example.com/to%C2%A3?unit=%B5"), ) def test_commented_meta_refresh(self): # html commented meta refresh header must not directed baseurl = "http://example.com" body = """""" self.assertEqual(get_meta_refresh(body, baseurl), (None, None)) def test_html_comments_with_uncommented_meta_refresh(self): # html comments must not interfere with uncommented meta refresh header baseurl = "http://example.com" body = """-->""" self.assertEqual(get_meta_refresh(body, baseurl), (3, "http://example.com/")) def test_float_refresh_intervals(self): # float refresh intervals baseurl = "http://example.com" body = """""" self.assertEqual( get_meta_refresh(body, baseurl), (0.1, "http://example.com/index.html") ) body = """""" self.assertEqual( get_meta_refresh(body, baseurl), (3.1, "http://example.com/index.html") ) def test_tag_name(self): baseurl = "http://example.org" body = """ Dummy blahablsdfsal& """ self.assertEqual(get_meta_refresh(body, baseurl), (None, None)) def test_leading_newline_in_url(self): baseurl = "http://example.org" body = """ Dummy """ self.assertEqual( get_meta_refresh(body, baseurl), (0.0, "http://www.example.org/index.php") ) def test_inside_noscript(self): baseurl = "http://example.org" body = """ """ self.assertEqual(get_meta_refresh(body, baseurl), (None, None)) self.assertEqual( get_meta_refresh(body, baseurl, ignore_tags=()), (0.0, "http://example.org/javascript_required"), ) def test_inside_script(self): baseurl = "http://example.org" body = """ """ self.assertEqual(get_meta_refresh(body, baseurl), (None, None)) self.assertEqual( get_meta_refresh(body, baseurl, ignore_tags=()), (0.0, "http://example.org/foobar_required"), ) def test_redirections_in_different_ordering__in_meta_tag(self): baseurl = "http://localhost:8000" url1 = '' url2 = '' self.assertEqual( get_meta_refresh(url1, baseurl), (0.0, "http://localhost:8000/dummy.html") ) self.assertEqual( get_meta_refresh(url2, baseurl), (0.0, "http://localhost:8000/dummy.html") ) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/test_http.py0000644000175100001660000000613314745713200015700 0ustar00runnerdockerimport unittest from collections import OrderedDict from w3lib.http import ( HeadersDictInput, basic_auth_header, headers_dict_to_raw, headers_raw_to_dict, ) __doctests__ = ["w3lib.http"] # for trial support class HttpTests(unittest.TestCase): def test_basic_auth_header(self): self.assertEqual( b"Basic c29tZXVzZXI6c29tZXBhc3M=", basic_auth_header("someuser", "somepass") ) # Check url unsafe encoded header self.assertEqual( b"Basic c29tZXVzZXI6QDx5dTk+Jm8/UQ==", basic_auth_header("someuser", "@&o?Q"), ) def test_basic_auth_header_encoding(self): self.assertEqual( b"Basic c29tw6Z1c8Oocjpzw7htZXDDpHNz", basic_auth_header("somæusèr", "sømepäss", encoding="utf8"), ) # default encoding (ISO-8859-1) self.assertEqual( b"Basic c29t5nVz6HI6c/htZXDkc3M=", basic_auth_header("somæusèr", "sømepäss") ) def test_headers_raw_dict_none(self): self.assertIsNone(headers_raw_to_dict(None)) self.assertIsNone(headers_dict_to_raw(None)) def test_headers_raw_to_dict(self): raw = b"Content-type: text/html\n\rAccept: gzip\n\r\ Cache-Control: no-cache\n\rCache-Control: no-store\n\n" dct = { b"Content-type": [b"text/html"], b"Accept": [b"gzip"], b"Cache-Control": [b"no-cache", b"no-store"], } self.assertEqual(headers_raw_to_dict(raw), dct) def test_headers_dict_to_raw(self): dct = OrderedDict([(b"Content-type", b"text/html"), (b"Accept", b"gzip")]) self.assertEqual( headers_dict_to_raw(dct), b"Content-type: text/html\r\nAccept: gzip" ) def test_headers_dict_to_raw_listtuple(self): dct: HeadersDictInput = OrderedDict( [(b"Content-type", [b"text/html"]), (b"Accept", [b"gzip"])] ) self.assertEqual( headers_dict_to_raw(dct), b"Content-type: text/html\r\nAccept: gzip" ) dct = OrderedDict([(b"Content-type", (b"text/html",)), (b"Accept", (b"gzip",))]) self.assertEqual( headers_dict_to_raw(dct), b"Content-type: text/html\r\nAccept: gzip" ) dct = OrderedDict([(b"Cookie", (b"val001", b"val002")), (b"Accept", b"gzip")]) self.assertEqual( headers_dict_to_raw(dct), b"Cookie: val001\r\nCookie: val002\r\nAccept: gzip", ) dct = OrderedDict([(b"Cookie", [b"val001", b"val002"]), (b"Accept", b"gzip")]) self.assertEqual( headers_dict_to_raw(dct), b"Cookie: val001\r\nCookie: val002\r\nAccept: gzip", ) def test_headers_dict_to_raw_wrong_values(self): dct: HeadersDictInput = OrderedDict( [ (b"Content-type", 0), ] ) self.assertEqual(headers_dict_to_raw(dct), b"") self.assertEqual(headers_dict_to_raw(dct), b"") dct = OrderedDict([(b"Content-type", 1), (b"Accept", [b"gzip"])]) self.assertEqual(headers_dict_to_raw(dct), b"Accept: gzip") ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/test_url.py0000644000175100001660000017603214745713200015531 0ustar00runnerdockerfrom __future__ import annotations import os import sys import unittest from inspect import isclass from typing import Callable from urllib.parse import urlparse import pytest from w3lib._infra import ( _ASCII_ALPHA, _ASCII_ALPHANUMERIC, _ASCII_TAB_OR_NEWLINE, _C0_CONTROL_OR_SPACE, ) from w3lib._types import StrOrBytes from w3lib._url import _SPECIAL_SCHEMES from w3lib.url import ( add_or_replace_parameter, add_or_replace_parameters, any_to_uri, canonicalize_url, file_uri_to_path, is_url, parse_data_uri, parse_url, path_to_file_uri, safe_download_url, safe_url_string, url_query_cleaner, url_query_parameter, ) # Test cases for URL-to-safe-URL conversions with a URL and an encoding as # input parameters. # # (encoding, input URL, output URL or exception) SAFE_URL_ENCODING_CASES: list[tuple[str | None, StrOrBytes, str | type[Exception]]] = [ (None, "", ValueError), (None, "https://example.com", "https://example.com"), (None, "https://example.com/©", "https://example.com/%C2%A9"), # Paths are always UTF-8-encoded. ("iso-8859-1", "https://example.com/©", "https://example.com/%C2%A9"), # Queries are UTF-8-encoded if the scheme is not special, ws or wss. ("iso-8859-1", "a://example.com?©", "a://example.com?%C2%A9"), *( ("iso-8859-1", f"{scheme}://example.com?©", f"{scheme}://example.com?%C2%A9") for scheme in ("ws", "wss") ), *( ("iso-8859-1", f"{scheme}://example.com?©", f"{scheme}://example.com?%A9") for scheme in _SPECIAL_SCHEMES if scheme not in {"ws", "wss"} ), # Fragments are always UTF-8-encoded. ("iso-8859-1", "https://example.com#©", "https://example.com#%C2%A9"), ] INVALID_SCHEME_FOLLOW_UPS = "".join( chr(value) for value in range(0x81) if ( chr(value) not in _ASCII_ALPHANUMERIC and chr(value) not in "+-." and chr(value) not in _C0_CONTROL_OR_SPACE # stripped and chr(value) != ":" # separator ) ) SAFE_URL_URL_INVALID_SCHEME_CASES = tuple( (f"{scheme}://example.com", ValueError) for scheme in ( # A scheme is required. "", # The first scheme letter must be an ASCII alpha. # Note: 0x80 is included below to also test non-ASCII example. *( chr(value) for value in range(0x81) if ( chr(value) not in _ASCII_ALPHA and chr(value) not in _C0_CONTROL_OR_SPACE # stripped and chr(value) != ":" # separator ) ), # The follow-up scheme letters can also be ASCII numbers, plus, hyphen, # or period. f"a{INVALID_SCHEME_FOLLOW_UPS}", ) ) SCHEME_NON_FIRST = _ASCII_ALPHANUMERIC + "+-." # Username and password characters that do not need escaping. # Removed for RFC 2396 and RFC 3986: % # Removed for the URL living standard: :;= USERINFO_SAFE = _ASCII_ALPHANUMERIC + "-_.!~*'()" + "&+$," USERNAME_TO_ENCODE = "".join( chr(value) for value in range(0x80) if ( chr(value) not in _C0_CONTROL_OR_SPACE and chr(value) not in USERINFO_SAFE and chr(value) not in ":/?#\\[]" ) ) USERNAME_ENCODED = "".join(f"%{ord(char):02X}" for char in USERNAME_TO_ENCODE) PASSWORD_TO_ENCODE = USERNAME_TO_ENCODE + ":" PASSWORD_ENCODED = "".join(f"%{ord(char):02X}" for char in PASSWORD_TO_ENCODE) # Path characters that do not need escaping. # Removed for RFC 2396 and RFC 3986: %[\]^| PATH_SAFE = _ASCII_ALPHANUMERIC + "-_.!~*'()" + ":@&=+$," + "/" + ";" PATH_TO_ENCODE = "".join( chr(value) for value in range(0x80) if ( chr(value) not in _C0_CONTROL_OR_SPACE and chr(value) not in PATH_SAFE and chr(value) not in "?#\\" ) ) PATH_ENCODED = "".join(f"%{ord(char):02X}" for char in PATH_TO_ENCODE) # Query characters that do not need escaping. # Removed for RFC 2396 and RFC 3986: %[\]^`{|} # Removed for the URL living standard: ' (special) QUERY_SAFE = _ASCII_ALPHANUMERIC + "-_.!~*'()" + ":@&=+$," + "/" + ";" + "?" QUERY_TO_ENCODE = "".join( chr(value) for value in range(0x80) if ( chr(value) not in _C0_CONTROL_OR_SPACE and chr(value) not in QUERY_SAFE and chr(value) not in "#" ) ) QUERY_ENCODED = "".join(f"%{ord(char):02X}" for char in QUERY_TO_ENCODE) SPECIAL_QUERY_SAFE = QUERY_SAFE.replace("'", "") SPECIAL_QUERY_TO_ENCODE = "".join( chr(value) for value in range(0x80) if ( chr(value) not in _C0_CONTROL_OR_SPACE and chr(value) not in SPECIAL_QUERY_SAFE and chr(value) not in "#" ) ) SPECIAL_QUERY_ENCODED = "".join(f"%{ord(char):02X}" for char in SPECIAL_QUERY_TO_ENCODE) # Fragment characters that do not need escaping. # Removed for RFC 2396 and RFC 3986: #%[\\]^{|} FRAGMENT_SAFE = _ASCII_ALPHANUMERIC + "-_.!~*'()" + ":@&=+$," + "/" + ";" + "?" FRAGMENT_TO_ENCODE = "".join( chr(value) for value in range(0x80) if (chr(value) not in _C0_CONTROL_OR_SPACE and chr(value) not in FRAGMENT_SAFE) ) FRAGMENT_ENCODED = "".join(f"%{ord(char):02X}" for char in FRAGMENT_TO_ENCODE) # Test cases for URL-to-safe-URL conversions with only a URL as input parameter # (i.e. no encoding or base URL). # # (input URL, output URL or exception) SAFE_URL_URL_CASES = ( # Invalid input type (1, Exception), (object(), Exception), # Empty string ("", ValueError), # Remove any leading and trailing C0 control or space from input. *( (f"{char}https://example.com{char}", "https://example.com") for char in _C0_CONTROL_OR_SPACE if char not in _ASCII_TAB_OR_NEWLINE ), # Remove all ASCII tab or newline from input. ( ( f"{_ASCII_TAB_OR_NEWLINE}h{_ASCII_TAB_OR_NEWLINE}ttps" f"{_ASCII_TAB_OR_NEWLINE}:{_ASCII_TAB_OR_NEWLINE}/" f"{_ASCII_TAB_OR_NEWLINE}/{_ASCII_TAB_OR_NEWLINE}a" f"{_ASCII_TAB_OR_NEWLINE}b{_ASCII_TAB_OR_NEWLINE}:" f"{_ASCII_TAB_OR_NEWLINE}a{_ASCII_TAB_OR_NEWLINE}b" f"{_ASCII_TAB_OR_NEWLINE}@{_ASCII_TAB_OR_NEWLINE}exam" f"{_ASCII_TAB_OR_NEWLINE}ple.com{_ASCII_TAB_OR_NEWLINE}:" f"{_ASCII_TAB_OR_NEWLINE}1{_ASCII_TAB_OR_NEWLINE}2" f"{_ASCII_TAB_OR_NEWLINE}/{_ASCII_TAB_OR_NEWLINE}a" f"{_ASCII_TAB_OR_NEWLINE}b{_ASCII_TAB_OR_NEWLINE}?" f"{_ASCII_TAB_OR_NEWLINE}a{_ASCII_TAB_OR_NEWLINE}b" f"{_ASCII_TAB_OR_NEWLINE}#{_ASCII_TAB_OR_NEWLINE}a" f"{_ASCII_TAB_OR_NEWLINE}b{_ASCII_TAB_OR_NEWLINE}" ), "https://ab:ab@example.com:12/ab?ab#ab", ), # Scheme (f"{_ASCII_ALPHA}://example.com", f"{_ASCII_ALPHA.lower()}://example.com"), ( f"a{SCHEME_NON_FIRST}://example.com", f"a{SCHEME_NON_FIRST.lower()}://example.com", ), *SAFE_URL_URL_INVALID_SCHEME_CASES, # Authority ("https://a@example.com", "https://a@example.com"), ("https://a:@example.com", "https://a:@example.com"), ("https://a:a@example.com", "https://a:a@example.com"), ("https://a%3A@example.com", "https://a%3A@example.com"), ( f"https://{USERINFO_SAFE}:{USERINFO_SAFE}@example.com", f"https://{USERINFO_SAFE}:{USERINFO_SAFE}@example.com", ), ( f"https://{USERNAME_TO_ENCODE}:{PASSWORD_TO_ENCODE}@example.com", f"https://{USERNAME_ENCODED}:{PASSWORD_ENCODED}@example.com", ), ("https://@\\example.com", ValueError), ("https://\x80:\x80@example.com", "https://%C2%80:%C2%80@example.com"), # Host ("https://example.com", "https://example.com"), ("https://.example", "https://.example"), ("https://\x80.example", ValueError), ("https://%80.example", ValueError), # The 4 cases below test before and after crossing DNS length limits on # domain name labels (63 characters) and the domain name as a whole (253 # characters). However, all cases are expected to pass because the URL # living standard does not require domain names to be within these limits. (f"https://{'a' * 63}.example", f"https://{'a' * 63}.example"), (f"https://{'a' * 64}.example", f"https://{'a' * 64}.example"), ( f"https://{'a' * 63}.{'a' * 63}.{'a' * 63}.{'a' * 53}.example", f"https://{'a' * 63}.{'a' * 63}.{'a' * 63}.{'a' * 53}.example", ), ( f"https://{'a' * 63}.{'a' * 63}.{'a' * 63}.{'a' * 54}.example", f"https://{'a' * 63}.{'a' * 63}.{'a' * 63}.{'a' * 54}.example", ), ("https://ñ.example", "https://xn--ida.example"), ("http://192.168.0.0", "http://192.168.0.0"), ("http://192.168.0.256", ValueError), ("http://192.168.0.0.0", ValueError), ("http://[2a01:5cc0:1:2::4]", "http://[2a01:5cc0:1:2::4]"), ("http://[2a01:5cc0:1:2:3:4]", ValueError), # Port ("https://example.com:", "https://example.com:"), ("https://example.com:1", "https://example.com:1"), ("https://example.com:443", "https://example.com:443"), # Path ("https://example.com/", "https://example.com/"), ("https://example.com/a", "https://example.com/a"), ("https://example.com\\a", "https://example.com/a"), ("https://example.com/a\\b", "https://example.com/a/b"), ( f"https://example.com/{PATH_SAFE}", f"https://example.com/{PATH_SAFE}", ), ( f"https://example.com/{PATH_TO_ENCODE}", f"https://example.com/{PATH_ENCODED}", ), ("https://example.com/ñ", "https://example.com/%C3%B1"), ("https://example.com/ñ%C3%B1", "https://example.com/%C3%B1%C3%B1"), # Query ("https://example.com?", "https://example.com?"), ("https://example.com/?", "https://example.com/?"), ("https://example.com?a", "https://example.com?a"), ("https://example.com?a=", "https://example.com?a="), ("https://example.com?a=b", "https://example.com?a=b"), ( f"a://example.com?{QUERY_SAFE}", f"a://example.com?{QUERY_SAFE}", ), ( f"a://example.com?{QUERY_TO_ENCODE}", f"a://example.com?{QUERY_ENCODED}", ), *( ( f"{scheme}://example.com?{SPECIAL_QUERY_SAFE}", f"{scheme}://example.com?{SPECIAL_QUERY_SAFE}", ) for scheme in _SPECIAL_SCHEMES ), *( ( f"{scheme}://example.com?{SPECIAL_QUERY_TO_ENCODE}", f"{scheme}://example.com?{SPECIAL_QUERY_ENCODED}", ) for scheme in _SPECIAL_SCHEMES ), ("https://example.com?ñ", "https://example.com?%C3%B1"), ("https://example.com?ñ%C3%B1", "https://example.com?%C3%B1%C3%B1"), # Fragment ("https://example.com#", "https://example.com#"), ("https://example.com/#", "https://example.com/#"), ("https://example.com?#", "https://example.com?#"), ("https://example.com/?#", "https://example.com/?#"), ("https://example.com#a", "https://example.com#a"), ( f"a://example.com#{FRAGMENT_SAFE}", f"a://example.com#{FRAGMENT_SAFE}", ), ( f"a://example.com#{FRAGMENT_TO_ENCODE}", f"a://example.com#{FRAGMENT_ENCODED}", ), ("https://example.com#ñ", "https://example.com#%C3%B1"), ("https://example.com#ñ%C3%B1", "https://example.com#%C3%B1%C3%B1"), # All fields, UTF-8 wherever possible. ( "https://ñ:ñ@ñ.example:1/ñ?ñ#ñ", "https://%C3%B1:%C3%B1@xn--ida.example:1/%C3%B1?%C3%B1#%C3%B1", ), ) def _test_safe_url_func( url: StrOrBytes, *, encoding: str | None = None, output: str | type[Exception], func: Callable[..., str], ) -> None: kwargs = {} if encoding is not None: kwargs["encoding"] = encoding if isclass(output) and issubclass(output, Exception): with pytest.raises(output): func(url, **kwargs) return actual = func(url, **kwargs) assert actual == output assert func(actual, **kwargs) == output # Idempotency def _test_safe_url_string( url: StrOrBytes, *, encoding: str | None = None, output: str | type[Exception], ) -> None: return _test_safe_url_func( url, encoding=encoding, output=output, func=safe_url_string, ) KNOWN_SAFE_URL_STRING_ENCODING_ISSUES = { (None, ""), # Invalid URL # UTF-8 encoding is not enforced in non-special URLs, or in URLs with the # ws or wss schemas. ("iso-8859-1", "a://example.com?\xa9"), ("iso-8859-1", "ws://example.com?\xa9"), ("iso-8859-1", "wss://example.com?\xa9"), # UTF-8 encoding is not enforced on the fragment. ("iso-8859-1", "https://example.com#\xa9"), } @pytest.mark.parametrize( "encoding,url,output", tuple( ( case if case[:2] not in KNOWN_SAFE_URL_STRING_ENCODING_ISSUES else pytest.param(*case, marks=pytest.mark.xfail(strict=True)) ) for case in SAFE_URL_ENCODING_CASES ), ) def test_safe_url_string_encoding( encoding: str | None, url: StrOrBytes, output: str | type[Exception] ) -> None: _test_safe_url_string(url, encoding=encoding, output=output) KNOWN_SAFE_URL_STRING_URL_ISSUES = { "", # Invalid URL *(case[0] for case in SAFE_URL_URL_INVALID_SCHEME_CASES), # Userinfo characters that the URL living standard requires escaping (:;=) # are not escaped. "https://@\\example.com", # Invalid URL "https://\x80.example", # Invalid domain name (non-visible character) "https://%80.example", # Invalid domain name (non-visible character) "http://192.168.0.256", # Invalid IP address "http://192.168.0.0.0", # Invalid IP address / domain name "http://[2a01:5cc0:1:2::4]", # https://github.com/scrapy/w3lib/issues/193 "https://example.com:", # Removes the : # Does not convert \ to / "https://example.com\\a", "https://example.com\\a\\b", # Encodes \ and / after the first one in the path "https://example.com/a/b", "https://example.com/a\\b", # Some path characters that RFC 2396 and RFC 3986 require escaping (%) # are not escaped. f"https://example.com/{PATH_TO_ENCODE}", # ? is removed "https://example.com?", "https://example.com/?", # Some query characters that RFC 2396 and RFC 3986 require escaping (%) # are not escaped. f"a://example.com?{QUERY_TO_ENCODE}", # Some special query characters that RFC 2396 and RFC 3986 require escaping # (%) are not escaped. *( f"{scheme}://example.com?{SPECIAL_QUERY_TO_ENCODE}" for scheme in _SPECIAL_SCHEMES ), # ? and # are removed "https://example.com#", "https://example.com/#", "https://example.com?#", "https://example.com/?#", # Some fragment characters that RFC 2396 and RFC 3986 require escaping # (%) are not escaped. f"a://example.com#{FRAGMENT_TO_ENCODE}", } if ( sys.version_info < (3, 9, 21) or (sys.version_info[:2] == (3, 10) and sys.version_info < (3, 10, 16)) or (sys.version_info[:2] == (3, 11) and sys.version_info < (3, 11, 4)) ): KNOWN_SAFE_URL_STRING_URL_ISSUES.add("http://[2a01:5cc0:1:2:3:4]") # Invalid IPv6 @pytest.mark.parametrize( "url,output", tuple( ( case if case[0] not in KNOWN_SAFE_URL_STRING_URL_ISSUES else pytest.param(*case, marks=pytest.mark.xfail(strict=True)) ) for case in SAFE_URL_URL_CASES ), ) def test_safe_url_string_url(url: StrOrBytes, output: str | type[Exception]) -> None: _test_safe_url_string(url, output=output) class UrlTests(unittest.TestCase): def test_safe_url_string(self): # Motoko Kusanagi (Cyborg from Ghost in the Shell) motoko = "\u8349\u8599 \u7d20\u5b50" self.assertEqual( safe_url_string(motoko), # note the %20 for space "%E8%8D%89%E8%96%99%20%E7%B4%A0%E5%AD%90", ) self.assertEqual( safe_url_string(motoko), safe_url_string(safe_url_string(motoko)) ) self.assertEqual(safe_url_string("©"), "%C2%A9") # copyright symbol # page-encoding does not affect URL path self.assertEqual(safe_url_string("©", "iso-8859-1"), "%C2%A9") # path_encoding does self.assertEqual(safe_url_string("©", path_encoding="iso-8859-1"), "%A9") self.assertEqual( safe_url_string("http://www.example.org/"), "http://www.example.org/" ) alessi = "/ecommerce/oggetto/Te \xf2/tea-strainer/1273" self.assertEqual( safe_url_string(alessi), "/ecommerce/oggetto/Te%20%C3%B2/tea-strainer/1273" ) self.assertEqual( safe_url_string( "http://www.example.com/test?p(29)url(http://www.another.net/page)" ), "http://www.example.com/test?p(29)url(http://www.another.net/page)", ) self.assertEqual( safe_url_string( "http://www.example.com/Brochures_&_Paint_Cards&PageSize=200" ), "http://www.example.com/Brochures_&_Paint_Cards&PageSize=200", ) # page-encoding does not affect URL path # we still end up UTF-8 encoding characters before percent-escaping safeurl = safe_url_string("http://www.example.com/£") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3") safeurl = safe_url_string("http://www.example.com/£", encoding="utf-8") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3") safeurl = safe_url_string("http://www.example.com/£", encoding="latin-1") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3") safeurl = safe_url_string("http://www.example.com/£", path_encoding="latin-1") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%A3") self.assertTrue(isinstance(safe_url_string(b"http://example.com/"), str)) def test_safe_url_string_remove_ascii_tab_and_newlines(self): self.assertEqual( safe_url_string("http://example.com/test\n.html"), "http://example.com/test.html", ) self.assertEqual( safe_url_string("http://example.com/test\t.html"), "http://example.com/test.html", ) self.assertEqual( safe_url_string("http://example.com/test\r.html"), "http://example.com/test.html", ) self.assertEqual( safe_url_string("http://example.com/test\r.html\n"), "http://example.com/test.html", ) self.assertEqual( safe_url_string("http://example.com/test\r\n.html\t"), "http://example.com/test.html", ) self.assertEqual( safe_url_string("http://example.com/test\a\n.html"), "http://example.com/test%07.html", ) def test_safe_url_string_quote_path(self): safeurl = safe_url_string('http://google.com/"hello"', quote_path=True) self.assertEqual(safeurl, "http://google.com/%22hello%22") safeurl = safe_url_string('http://google.com/"hello"', quote_path=False) self.assertEqual(safeurl, 'http://google.com/"hello"') safeurl = safe_url_string('http://google.com/"hello"') self.assertEqual(safeurl, "http://google.com/%22hello%22") def test_safe_url_string_with_query(self): safeurl = safe_url_string("http://www.example.com/£?unit=µ") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%C2%B5") safeurl = safe_url_string("http://www.example.com/£?unit=µ", encoding="utf-8") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%C2%B5") safeurl = safe_url_string("http://www.example.com/£?unit=µ", encoding="latin-1") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%B5") safeurl = safe_url_string( "http://www.example.com/£?unit=µ", path_encoding="latin-1" ) self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%A3?unit=%C2%B5") safeurl = safe_url_string( "http://www.example.com/£?unit=µ", encoding="latin-1", path_encoding="latin-1", ) self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%A3?unit=%B5") def test_safe_url_string_misc(self): # mixing Unicode and percent-escaped sequences safeurl = safe_url_string("http://www.example.com/£?unit=%C2%B5") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%C2%B5") safeurl = safe_url_string("http://www.example.com/%C2%A3?unit=µ") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%C2%B5") def test_safe_url_string_bytes_input(self): safeurl = safe_url_string(b"http://www.example.com/") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/") # bytes input is assumed to be UTF-8 safeurl = safe_url_string(b"http://www.example.com/\xc2\xb5") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%B5") # page-encoding encoded bytes still end up as UTF-8 sequences in path safeurl = safe_url_string(b"http://www.example.com/\xb5", encoding="latin1") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%B5") safeurl = safe_url_string( b"http://www.example.com/\xa3?unit=\xb5", encoding="latin1" ) self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%C2%A3?unit=%B5") def test_safe_url_string_bytes_input_nonutf8(self): # latin1 safeurl = safe_url_string(b"http://www.example.com/\xa3?unit=\xb5") self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/%A3?unit=%B5") # cp1251 # >>> 'Россия'.encode('cp1251') # '\xd0\xee\xf1\xf1\xe8\xff' safeurl = safe_url_string( b"http://www.example.com/country/\xd0\xee\xf1\xf1\xe8\xff" ) self.assertTrue(isinstance(safeurl, str)) self.assertEqual(safeurl, "http://www.example.com/country/%D0%EE%F1%F1%E8%FF") def test_safe_url_idna(self): # adapted from: # https://ssl.icu-project.org/icu-bin/idnbrowser # http://unicode.org/faq/idn.html # + various others websites = ( ( "http://www.färgbolaget.nu/färgbolaget", "http://www.xn--frgbolaget-q5a.nu/f%C3%A4rgbolaget", ), ( "http://www.räksmörgås.se/?räksmörgås=yes", "http://www.xn--rksmrgs-5wao1o.se/?r%C3%A4ksm%C3%B6rg%C3%A5s=yes", ), ( "http://www.brændendekærlighed.com/brændende/kærlighed", "http://www.xn--brndendekrlighed-vobh.com/br%C3%A6ndende/k%C3%A6rlighed", ), ("http://www.예비교사.com", "http://www.xn--9d0bm53a3xbzui.com"), ("http://理容ナカムラ.com", "http://xn--lck1c3crb1723bpq4a.com"), ("http://あーるいん.com", "http://xn--l8je6s7a45b.com"), # --- real websites --- # in practice, this redirect (301) to http://www.buecher.de/?q=b%C3%BCcher ( "http://www.bücher.de/?q=bücher", "http://www.xn--bcher-kva.de/?q=b%C3%BCcher", ), # Japanese ( "http://はじめよう.みんな/?query=サ&maxResults=5", "http://xn--p8j9a0d9c9a.xn--q9jyb4c/?query=%E3%82%B5&maxResults=5", ), # Russian ("http://кто.рф/", "http://xn--j1ail.xn--p1ai/"), ( "http://кто.рф/index.php?domain=Что", "http://xn--j1ail.xn--p1ai/index.php?domain=%D0%A7%D1%82%D0%BE", ), # Korean ("http://내도메인.한국/", "http://xn--220b31d95hq8o.xn--3e0b707e/"), ( "http://맨체스터시티축구단.한국/", "http://xn--2e0b17htvgtvj9haj53ccob62ni8d.xn--3e0b707e/", ), # Arabic ("http://nic.شبكة", "http://nic.xn--ngbc5azd"), # Chinese ("https://www.贷款.在线", "https://www.xn--0kwr83e.xn--3ds443g"), ("https://www2.xn--0kwr83e.在线", "https://www2.xn--0kwr83e.xn--3ds443g"), ("https://www3.贷款.xn--3ds443g", "https://www3.xn--0kwr83e.xn--3ds443g"), ) for idn_input, safe_result in websites: safeurl = safe_url_string(idn_input) self.assertEqual(safeurl, safe_result) # make sure the safe URL is unchanged when made safe a 2nd time for _, safe_result in websites: safeurl = safe_url_string(safe_result) self.assertEqual(safeurl, safe_result) def test_safe_url_idna_encoding_failure(self): # missing DNS label self.assertEqual( safe_url_string("http://.example.com/résumé?q=résumé"), "http://.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) # DNS label too long self.assertEqual( safe_url_string(f"http://www.{'example' * 11}.com/résumé?q=résumé"), f"http://www.{'example' * 11}.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) def test_safe_url_port_number(self): self.assertEqual( safe_url_string("http://www.example.com:80/résumé?q=résumé"), "http://www.example.com:80/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) self.assertEqual( safe_url_string("http://www.example.com:/résumé?q=résumé"), "http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) def test_safe_url_string_preserve_nonfragment_hash(self): # don't decode `%23` to `#` self.assertEqual( safe_url_string("http://www.example.com/path/to/%23/foo/bar"), "http://www.example.com/path/to/%23/foo/bar", ) self.assertEqual( safe_url_string("http://www.example.com/path/to/%23/foo/bar#frag"), "http://www.example.com/path/to/%23/foo/bar#frag", ) self.assertEqual( safe_url_string( "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2Fpath%2Fto%2F%23%2Fbar%2Ffoo" ), "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2Fpath%2Fto%2F%23%2Fbar%2Ffoo", ) self.assertEqual( safe_url_string( "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo#frag" ), "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo#frag", ) def test_safe_url_string_encode_idna_domain_with_port(self): self.assertEqual( safe_url_string("http://新华网.中国:80"), "http://xn--xkrr14bows.xn--fiqs8s:80", ) def test_safe_url_string_encode_idna_domain_with_username_password_and_port_number( self, ): self.assertEqual( safe_url_string("ftp://admin:admin@新华网.中国:21"), "ftp://admin:admin@xn--xkrr14bows.xn--fiqs8s:21", ) self.assertEqual( safe_url_string("http://Åsa:abc123@➡.ws:81/admin"), "http://%C3%85sa:abc123@xn--hgi.ws:81/admin", ) self.assertEqual( safe_url_string("http://japão:não@️i❤️.ws:8000/"), "http://jap%C3%A3o:n%C3%A3o@xn--i-7iq.ws:8000/", ) def test_safe_url_string_encode_idna_domain_with_username_and_empty_password_and_port_number( self, ): self.assertEqual( safe_url_string("ftp://admin:@新华网.中国:21"), "ftp://admin:@xn--xkrr14bows.xn--fiqs8s:21", ) self.assertEqual( safe_url_string("ftp://admin@新华网.中国:21"), "ftp://admin@xn--xkrr14bows.xn--fiqs8s:21", ) def test_safe_url_string_userinfo_unsafe_chars( self, ): self.assertEqual( safe_url_string("ftp://admin:|%@example.com"), "ftp://admin:%7C%25@example.com", ) def test_safe_url_string_user_and_pass_percentage_encoded(self): self.assertEqual( safe_url_string("http://%25user:%25pass@host"), "http://%25user:%25pass@host", ) self.assertEqual( safe_url_string("http://%user:%pass@host"), "http://%25user:%25pass@host", ) self.assertEqual( safe_url_string("http://%26user:%26pass@host"), "http://&user:&pass@host", ) self.assertEqual( safe_url_string("http://%2525user:%2525pass@host"), "http://%2525user:%2525pass@host", ) self.assertEqual( safe_url_string("http://%2526user:%2526pass@host"), "http://%2526user:%2526pass@host", ) self.assertEqual( safe_url_string("http://%25%26user:%25%26pass@host"), "http://%25&user:%25&pass@host", ) def test_safe_download_url(self): self.assertEqual( safe_download_url("http://www.example.org"), "http://www.example.org/" ) self.assertEqual( safe_download_url("http://www.example.org/../"), "http://www.example.org/" ) self.assertEqual( safe_download_url("http://www.example.org/../../images/../image"), "http://www.example.org/image", ) self.assertEqual( safe_download_url("http://www.example.org/dir/"), "http://www.example.org/dir/", ) self.assertEqual( safe_download_url(b"http://www.example.org/dir/"), "http://www.example.org/dir/", ) # Encoding related tests self.assertEqual( safe_download_url( b"http://www.example.org?\xa3", encoding="latin-1", path_encoding="latin-1", ), "http://www.example.org/?%A3", ) self.assertEqual( safe_download_url( b"http://www.example.org?\xc2\xa3", encoding="utf-8", path_encoding="utf-8", ), "http://www.example.org/?%C2%A3", ) self.assertEqual( safe_download_url( b"http://www.example.org/\xc2\xa3?\xc2\xa3", encoding="utf-8", path_encoding="latin-1", ), "http://www.example.org/%A3?%C2%A3", ) def test_is_url(self): self.assertTrue(is_url("http://www.example.org")) self.assertTrue(is_url("https://www.example.org")) self.assertTrue(is_url("file:///some/path")) self.assertFalse(is_url("foo://bar")) self.assertFalse(is_url("foo--bar")) def test_url_query_parameter(self): self.assertEqual( url_query_parameter("product.html?id=200&foo=bar", "id"), "200" ) self.assertEqual( url_query_parameter("product.html?id=200&foo=bar", "notthere", "mydefault"), "mydefault", ) self.assertEqual(url_query_parameter("product.html?id=", "id"), None) self.assertEqual( url_query_parameter("product.html?id=", "id", keep_blank_values=1), "" ) @pytest.mark.xfail def test_url_query_parameter_2(self): """ This problem was seen several times in the feeds. Sometime affiliate URLs contains nested encoded affiliate URL with direct URL as parameters. For example: aff_url1 = 'http://www.tkqlhce.com/click-2590032-10294381?url=http%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FArgosCreateReferral%3FstoreId%3D10001%26langId%3D-1%26referrer%3DCOJUN%26params%3Dadref%253DGarden+and+DIY-%3EGarden+furniture-%3EChildren%26%2339%3Bs+garden+furniture%26referredURL%3Dhttp%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FProductDisplay%253FstoreId%253D10001%2526catalogId%253D1500001501%2526productId%253D1500357023%2526langId%253D-1' the typical code to extract needed URL from it is: aff_url2 = url_query_parameter(aff_url1, 'url') after this aff2_url is: 'http://www.argos.co.uk/webapp/wcs/stores/servlet/ArgosCreateReferral?storeId=10001&langId=-1&referrer=COJUN¶ms=adref%3DGarden and DIY->Garden furniture->Children's gardenfurniture&referredURL=http://www.argos.co.uk/webapp/wcs/stores/servlet/ProductDisplay%3FstoreId%3D10001%26catalogId%3D1500001501%26productId%3D1500357023%26langId%3D-1' the direct URL extraction is url = url_query_parameter(aff_url2, 'referredURL') but this will not work, because aff_url2 contains ' (comma sign encoded in the feed) and the URL extraction will fail, current workaround was made in the spider, just a replace for ' to %27 """ # correct case aff_url1 = "http://www.anrdoezrs.net/click-2590032-10294381?url=http%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FArgosCreateReferral%3FstoreId%3D10001%26langId%3D-1%26referrer%3DCOJUN%26params%3Dadref%253DGarden+and+DIY-%3EGarden+furniture-%3EGarden+table+and+chair+sets%26referredURL%3Dhttp%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FProductDisplay%253FstoreId%253D10001%2526catalogId%253D1500001501%2526productId%253D1500357199%2526langId%253D-1" aff_url2 = url_query_parameter(aff_url1, "url") self.assertEqual( aff_url2, "http://www.argos.co.uk/webapp/wcs/stores/servlet/ArgosCreateReferral?storeId=10001&langId=-1&referrer=COJUN¶ms=adref%3DGarden and DIY->Garden furniture->Garden table and chair sets&referredURL=http://www.argos.co.uk/webapp/wcs/stores/servlet/ProductDisplay%3FstoreId%3D10001%26catalogId%3D1500001501%26productId%3D1500357199%26langId%3D-1", ) assert aff_url2 is not None prod_url = url_query_parameter(aff_url2, "referredURL") self.assertEqual( prod_url, "http://www.argos.co.uk/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=1500001501&productId=1500357199&langId=-1", ) # weird case aff_url1 = "http://www.tkqlhce.com/click-2590032-10294381?url=http%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FArgosCreateReferral%3FstoreId%3D10001%26langId%3D-1%26referrer%3DCOJUN%26params%3Dadref%253DGarden+and+DIY-%3EGarden+furniture-%3EChildren%26%2339%3Bs+garden+furniture%26referredURL%3Dhttp%3A%2F%2Fwww.argos.co.uk%2Fwebapp%2Fwcs%2Fstores%2Fservlet%2FProductDisplay%253FstoreId%253D10001%2526catalogId%253D1500001501%2526productId%253D1500357023%2526langId%253D-1" aff_url2 = url_query_parameter(aff_url1, "url") self.assertEqual( aff_url2, "http://www.argos.co.uk/webapp/wcs/stores/servlet/ArgosCreateReferral?storeId=10001&langId=-1&referrer=COJUN¶ms=adref%3DGarden and DIY->Garden furniture->Children's garden furniture&referredURL=http://www.argos.co.uk/webapp/wcs/stores/servlet/ProductDisplay%3FstoreId%3D10001%26catalogId%3D1500001501%26productId%3D1500357023%26langId%3D-1", ) assert aff_url2 is not None prod_url = url_query_parameter(aff_url2, "referredURL") # fails, prod_url is None now self.assertEqual( prod_url, "http://www.argos.co.uk/webapp/wcs/stores/servlet/ProductDisplay?storeId=10001&catalogId=1500001501&productId=1500357023&langId=-1", ) def test_add_or_replace_parameter(self): url = "http://domain/test" self.assertEqual( add_or_replace_parameter(url, "arg", "v"), "http://domain/test?arg=v" ) url = "http://domain/test?arg1=v1&arg2=v2&arg3=v3" self.assertEqual( add_or_replace_parameter(url, "arg4", "v4"), "http://domain/test?arg1=v1&arg2=v2&arg3=v3&arg4=v4", ) self.assertEqual( add_or_replace_parameter(url, "arg3", "nv3"), "http://domain/test?arg1=v1&arg2=v2&arg3=nv3", ) self.assertEqual( add_or_replace_parameter( "http://domain/moreInfo.asp?prodID=", "prodID", "20" ), "http://domain/moreInfo.asp?prodID=20", ) url = "http://rmc-offers.co.uk/productlist.asp?BCat=2%2C60&CatID=60" self.assertEqual( add_or_replace_parameter(url, "BCat", "newvalue"), "http://rmc-offers.co.uk/productlist.asp?BCat=newvalue&CatID=60", ) url = "http://rmc-offers.co.uk/productlist.asp?BCat=2,60&CatID=60" self.assertEqual( add_or_replace_parameter(url, "BCat", "newvalue"), "http://rmc-offers.co.uk/productlist.asp?BCat=newvalue&CatID=60", ) url = "http://rmc-offers.co.uk/productlist.asp?" self.assertEqual( add_or_replace_parameter(url, "BCat", "newvalue"), "http://rmc-offers.co.uk/productlist.asp?BCat=newvalue", ) url = "http://example.com/?version=1&pageurl=http%3A%2F%2Fwww.example.com%2Ftest%2F%23fragment%3Dy¶m2=value2" self.assertEqual( add_or_replace_parameter(url, "version", "2"), "http://example.com/?version=2&pageurl=http%3A%2F%2Fwww.example.com%2Ftest%2F%23fragment%3Dy¶m2=value2", ) self.assertEqual( add_or_replace_parameter(url, "pageurl", "test"), "http://example.com/?version=1&pageurl=test¶m2=value2", ) url = "http://domain/test?arg1=v1&arg2=v2&arg1=v3" self.assertEqual( add_or_replace_parameter(url, "arg4", "v4"), "http://domain/test?arg1=v1&arg2=v2&arg1=v3&arg4=v4", ) self.assertEqual( add_or_replace_parameter(url, "arg1", "v3"), "http://domain/test?arg1=v3&arg2=v2", ) @pytest.mark.xfail(reason="https://github.com/scrapy/w3lib/issues/164") def test_add_or_replace_parameter_fail(self): self.assertEqual( add_or_replace_parameter( "http://domain/test?arg1=v1;arg2=v2", "arg1", "v3" ), "http://domain/test?arg1=v3&arg2=v2", ) def test_add_or_replace_parameters(self): url = "http://domain/test" self.assertEqual( add_or_replace_parameters(url, {"arg": "v"}), "http://domain/test?arg=v" ) url = "http://domain/test?arg1=v1&arg2=v2&arg3=v3" self.assertEqual( add_or_replace_parameters(url, {"arg4": "v4"}), "http://domain/test?arg1=v1&arg2=v2&arg3=v3&arg4=v4", ) self.assertEqual( add_or_replace_parameters(url, {"arg4": "v4", "arg3": "v3new"}), "http://domain/test?arg1=v1&arg2=v2&arg3=v3new&arg4=v4", ) url = "http://domain/test?arg1=v1&arg2=v2&arg1=v3" self.assertEqual( add_or_replace_parameters(url, {"arg4": "v4"}), "http://domain/test?arg1=v1&arg2=v2&arg1=v3&arg4=v4", ) self.assertEqual( add_or_replace_parameters(url, {"arg1": "v3"}), "http://domain/test?arg1=v3&arg2=v2", ) def test_add_or_replace_parameters_does_not_change_input_param(self): url = "http://domain/test?arg=original" input_param = {"arg": "value"} add_or_replace_parameters(url, input_param) # noqa self.assertEqual(input_param, {"arg": "value"}) def test_url_query_cleaner(self): self.assertEqual("product.html", url_query_cleaner("product.html?")) self.assertEqual("product.html", url_query_cleaner("product.html?&")) self.assertEqual( "product.html?id=200", url_query_cleaner("product.html?id=200&foo=bar&name=wired", ["id"]), ) self.assertEqual( "product.html?id=200", url_query_cleaner("product.html?&id=200&&foo=bar&name=wired", ["id"]), ) self.assertEqual( "product.html", url_query_cleaner("product.html?foo=bar&name=wired", ["id"]) ) self.assertEqual( "product.html?id=200&name=wired", url_query_cleaner("product.html?id=200&foo=bar&name=wired", ["id", "name"]), ) self.assertEqual( "product.html?id", url_query_cleaner("product.html?id&other=3&novalue=", ["id"]), ) # default is to remove duplicate keys self.assertEqual( "product.html?d=1", url_query_cleaner("product.html?d=1&e=b&d=2&d=3&other=other", ["d"]), ) # unique=False disables duplicate keys filtering self.assertEqual( "product.html?d=1&d=2&d=3", url_query_cleaner( "product.html?d=1&e=b&d=2&d=3&other=other", ["d"], unique=False ), ) self.assertEqual( "product.html?id=200&foo=bar", url_query_cleaner( "product.html?id=200&foo=bar&name=wired#id20", ["id", "foo"] ), ) self.assertEqual( "product.html?foo=bar&name=wired", url_query_cleaner( "product.html?id=200&foo=bar&name=wired", ["id"], remove=True ), ) self.assertEqual( "product.html?name=wired", url_query_cleaner( "product.html?id=2&foo=bar&name=wired", ["id", "foo"], remove=True ), ) self.assertEqual( "product.html?foo=bar&name=wired", url_query_cleaner( "product.html?id=2&foo=bar&name=wired", ["id", "footo"], remove=True ), ) self.assertEqual( "product.html", url_query_cleaner("product.html", ["id"], remove=True) ) self.assertEqual( "product.html", url_query_cleaner("product.html?&", ["id"], remove=True) ) self.assertEqual( "product.html?foo=bar", url_query_cleaner("product.html?foo=bar&name=wired", "foo"), ) self.assertEqual( "product.html?foobar=wired", url_query_cleaner("product.html?foo=bar&foobar=wired", "foobar"), ) def test_url_query_cleaner_keep_fragments(self): self.assertEqual( "product.html?id=200#foo", url_query_cleaner( "product.html?id=200&foo=bar&name=wired#foo", ["id"], keep_fragments=True, ), ) self.assertEqual( "product.html?id=200", url_query_cleaner( "product.html?id=200&foo=bar&name=wired", ["id"], keep_fragments=True ), ) def test_path_to_file_uri(self): if os.name == "nt": self.assertEqual( path_to_file_uri(r"C:\\windows\clock.avi"), "file:///C:/windows/clock.avi", ) else: self.assertEqual( path_to_file_uri("/some/path.txt"), "file:///some/path.txt" ) fn = "test.txt" x = path_to_file_uri(fn) self.assertTrue(x.startswith("file:///")) self.assertEqual(file_uri_to_path(x).lower(), os.path.abspath(fn).lower()) def test_file_uri_to_path(self): if os.name == "nt": self.assertEqual( file_uri_to_path("file:///C:/windows/clock.avi"), r"C:\\windows\clock.avi", ) uri = "file:///C:/windows/clock.avi" uri2 = path_to_file_uri(file_uri_to_path(uri)) self.assertEqual(uri, uri2) else: self.assertEqual( file_uri_to_path("file:///path/to/test.txt"), "/path/to/test.txt" ) self.assertEqual(file_uri_to_path("/path/to/test.txt"), "/path/to/test.txt") uri = "file:///path/to/test.txt" uri2 = path_to_file_uri(file_uri_to_path(uri)) self.assertEqual(uri, uri2) self.assertEqual(file_uri_to_path("test.txt"), "test.txt") def test_any_to_uri(self): if os.name == "nt": self.assertEqual( any_to_uri(r"C:\\windows\clock.avi"), "file:///C:/windows/clock.avi" ) else: self.assertEqual(any_to_uri("/some/path.txt"), "file:///some/path.txt") self.assertEqual(any_to_uri("file:///some/path.txt"), "file:///some/path.txt") self.assertEqual( any_to_uri("http://www.example.com/some/path.txt"), "http://www.example.com/some/path.txt", ) class CanonicalizeUrlTest(unittest.TestCase): def test_canonicalize_url(self): # simplest case self.assertEqual( canonicalize_url("http://www.example.com/"), "http://www.example.com/" ) def test_return_str(self): assert isinstance(canonicalize_url("http://www.example.com"), str) assert isinstance(canonicalize_url(b"http://www.example.com"), str) def test_append_missing_path(self): self.assertEqual( canonicalize_url("http://www.example.com"), "http://www.example.com/" ) def test_typical_usage(self): self.assertEqual( canonicalize_url("http://www.example.com/do?a=1&b=2&c=3"), "http://www.example.com/do?a=1&b=2&c=3", ) self.assertEqual( canonicalize_url("http://www.example.com/do?c=1&b=2&a=3"), "http://www.example.com/do?a=3&b=2&c=1", ) self.assertEqual( canonicalize_url("http://www.example.com/do?&a=1"), "http://www.example.com/do?a=1", ) def test_port_number(self): self.assertEqual( canonicalize_url("http://www.example.com:8888/do?a=1&b=2&c=3"), "http://www.example.com:8888/do?a=1&b=2&c=3", ) # trailing empty ports are removed self.assertEqual( canonicalize_url("http://www.example.com:/do?a=1&b=2&c=3"), "http://www.example.com/do?a=1&b=2&c=3", ) def test_sorting(self): self.assertEqual( canonicalize_url("http://www.example.com/do?c=3&b=5&b=2&a=50"), "http://www.example.com/do?a=50&b=2&b=5&c=3", ) def test_keep_blank_values(self): self.assertEqual( canonicalize_url( "http://www.example.com/do?b=&a=2", keep_blank_values=False ), "http://www.example.com/do?a=2", ) self.assertEqual( canonicalize_url("http://www.example.com/do?b=&a=2"), "http://www.example.com/do?a=2&b=", ) self.assertEqual( canonicalize_url( "http://www.example.com/do?b=&c&a=2", keep_blank_values=False ), "http://www.example.com/do?a=2", ) self.assertEqual( canonicalize_url("http://www.example.com/do?b=&c&a=2"), "http://www.example.com/do?a=2&b=&c=", ) self.assertEqual( canonicalize_url("http://www.example.com/do?1750,4"), "http://www.example.com/do?1750%2C4=", ) def test_spaces(self): self.assertEqual( canonicalize_url("http://www.example.com/do?q=a space&a=1"), "http://www.example.com/do?a=1&q=a+space", ) self.assertEqual( canonicalize_url("http://www.example.com/do?q=a+space&a=1"), "http://www.example.com/do?a=1&q=a+space", ) self.assertEqual( canonicalize_url("http://www.example.com/do?q=a%20space&a=1"), "http://www.example.com/do?a=1&q=a+space", ) def test_canonicalize_url_unicode_path(self): self.assertEqual( canonicalize_url("http://www.example.com/résumé"), "http://www.example.com/r%C3%A9sum%C3%A9", ) def test_canonicalize_url_unicode_query_string(self): # default encoding for path and query is UTF-8 self.assertEqual( canonicalize_url("http://www.example.com/résumé?q=résumé"), "http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) # passed encoding will affect query string self.assertEqual( canonicalize_url( "http://www.example.com/résumé?q=résumé", encoding="latin1" ), "http://www.example.com/r%C3%A9sum%C3%A9?q=r%E9sum%E9", ) self.assertEqual( canonicalize_url( "http://www.example.com/résumé?country=Россия", encoding="cp1251" ), "http://www.example.com/r%C3%A9sum%C3%A9?country=%D0%EE%F1%F1%E8%FF", ) def test_canonicalize_url_unicode_query_string_wrong_encoding(self): # trying to encode with wrong encoding # fallback to UTF-8 self.assertEqual( canonicalize_url( "http://www.example.com/résumé?currency=€", encoding="latin1" ), "http://www.example.com/r%C3%A9sum%C3%A9?currency=%E2%82%AC", ) self.assertEqual( canonicalize_url( "http://www.example.com/résumé?country=Россия", encoding="latin1" ), "http://www.example.com/r%C3%A9sum%C3%A9?country=%D0%A0%D0%BE%D1%81%D1%81%D0%B8%D1%8F", ) def test_normalize_percent_encoding_in_paths(self): self.assertEqual( canonicalize_url("http://www.example.com/r%c3%a9sum%c3%a9"), "http://www.example.com/r%C3%A9sum%C3%A9", ) # non-UTF8 encoded sequences: they should be kept untouched, only upper-cased # 'latin1'-encoded sequence in path self.assertEqual( canonicalize_url("http://www.example.com/a%a3do"), "http://www.example.com/a%A3do", ) # 'latin1'-encoded path, UTF-8 encoded query string self.assertEqual( canonicalize_url("http://www.example.com/a%a3do?q=r%c3%a9sum%c3%a9"), "http://www.example.com/a%A3do?q=r%C3%A9sum%C3%A9", ) # 'latin1'-encoded path and query string self.assertEqual( canonicalize_url("http://www.example.com/a%a3do?q=r%e9sum%e9"), "http://www.example.com/a%A3do?q=r%E9sum%E9", ) url = "https://example.com/a%23b%2cc#bash" canonical = canonicalize_url(url) # %23 is not accidentally interpreted as a URL fragment separator self.assertEqual(canonical, "https://example.com/a%23b,c") self.assertEqual(canonical, canonicalize_url(canonical)) def test_normalize_percent_encoding_in_query_arguments(self): self.assertEqual( canonicalize_url("http://www.example.com/do?k=b%a3"), "http://www.example.com/do?k=b%A3", ) self.assertEqual( canonicalize_url("http://www.example.com/do?k=r%c3%a9sum%c3%a9"), "http://www.example.com/do?k=r%C3%A9sum%C3%A9", ) def test_non_ascii_percent_encoding_in_paths(self): self.assertEqual( canonicalize_url("http://www.example.com/a do?a=1"), "http://www.example.com/a%20do?a=1", ) self.assertEqual( canonicalize_url("http://www.example.com/a %20do?a=1"), "http://www.example.com/a%20%20do?a=1", ) self.assertEqual( canonicalize_url("http://www.example.com/a do£.html?a=1"), "http://www.example.com/a%20do%C2%A3.html?a=1", ) self.assertEqual( canonicalize_url(b"http://www.example.com/a do\xc2\xa3.html?a=1"), "http://www.example.com/a%20do%C2%A3.html?a=1", ) def test_non_ascii_percent_encoding_in_query_arguments(self): self.assertEqual( canonicalize_url("http://www.example.com/do?price=£500&a=5&z=3"), "http://www.example.com/do?a=5&price=%C2%A3500&z=3", ) self.assertEqual( canonicalize_url(b"http://www.example.com/do?price=\xc2\xa3500&a=5&z=3"), "http://www.example.com/do?a=5&price=%C2%A3500&z=3", ) self.assertEqual( canonicalize_url(b"http://www.example.com/do?price(\xc2\xa3)=500&a=1"), "http://www.example.com/do?a=1&price%28%C2%A3%29=500", ) def test_urls_with_auth_and_ports(self): self.assertEqual( canonicalize_url("http://user:pass@www.example.com:81/do?now=1"), "http://user:pass@www.example.com:81/do?now=1", ) def test_remove_fragments(self): self.assertEqual( canonicalize_url("http://user:pass@www.example.com/do?a=1#frag"), "http://user:pass@www.example.com/do?a=1", ) self.assertEqual( canonicalize_url( "http://user:pass@www.example.com/do?a=1#frag", keep_fragments=True ), "http://user:pass@www.example.com/do?a=1#frag", ) def test_dont_convert_safe_characters(self): # dont convert safe characters to percent encoding representation self.assertEqual( canonicalize_url( "http://www.simplybedrooms.com/White-Bedroom-Furniture/Bedroom-Mirror:-Josephine-Cheval-Mirror.html" ), "http://www.simplybedrooms.com/White-Bedroom-Furniture/Bedroom-Mirror:-Josephine-Cheval-Mirror.html", ) def test_safe_characters_unicode(self): # urllib.quote uses a mapping cache of encoded characters. when parsing # an already percent-encoded url, it will fail if that url was not # percent-encoded as utf-8, that's why canonicalize_url must always # convert the urls to string. the following test asserts that # functionality. self.assertEqual( canonicalize_url("http://www.example.com/caf%E9-con-leche.htm"), "http://www.example.com/caf%E9-con-leche.htm", ) def test_domains_are_case_insensitive(self): self.assertEqual( canonicalize_url("http://www.EXAMPLE.com/"), "http://www.example.com/" ) def test_userinfo_is_case_sensitive(self): self.assertEqual( canonicalize_url("sftp://UsEr:PaSsWoRd@www.EXAMPLE.com/"), "sftp://UsEr:PaSsWoRd@www.example.com/", ) def test_canonicalize_idns(self): self.assertEqual( canonicalize_url("http://www.bücher.de?q=bücher"), "http://www.xn--bcher-kva.de/?q=b%C3%BCcher", ) # Japanese (+ reordering query parameters) self.assertEqual( canonicalize_url("http://はじめよう.みんな/?query=サ&maxResults=5"), "http://xn--p8j9a0d9c9a.xn--q9jyb4c/?maxResults=5&query=%E3%82%B5", ) def test_quoted_slash_and_question_sign(self): self.assertEqual( canonicalize_url("http://foo.com/AC%2FDC+rocks%3f/?yeah=1"), "http://foo.com/AC%2FDC+rocks%3F/?yeah=1", ) self.assertEqual( canonicalize_url("http://foo.com/AC%2FDC/"), "http://foo.com/AC%2FDC/" ) def test_canonicalize_urlparsed(self): # canonicalize_url() can be passed an already urlparse'd URL self.assertEqual( canonicalize_url(urlparse("http://www.example.com/résumé?q=résumé")), "http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) self.assertEqual( canonicalize_url(urlparse("http://www.example.com/caf%e9-con-leche.htm")), "http://www.example.com/caf%E9-con-leche.htm", ) self.assertEqual( canonicalize_url( urlparse("http://www.example.com/a%a3do?q=r%c3%a9sum%c3%a9") ), "http://www.example.com/a%A3do?q=r%C3%A9sum%C3%A9", ) def test_canonicalize_parse_url(self): # parse_url() wraps urlparse and is used in link extractors self.assertEqual( canonicalize_url(parse_url("http://www.example.com/résumé?q=résumé")), "http://www.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) self.assertEqual( canonicalize_url(parse_url("http://www.example.com/caf%e9-con-leche.htm")), "http://www.example.com/caf%E9-con-leche.htm", ) self.assertEqual( canonicalize_url( parse_url("http://www.example.com/a%a3do?q=r%c3%a9sum%c3%a9") ), "http://www.example.com/a%A3do?q=r%C3%A9sum%C3%A9", ) def test_canonicalize_url_idempotence(self): for url, enc in [ ("http://www.bücher.de/résumé?q=résumé", "utf8"), ("http://www.example.com/résumé?q=résumé", "latin1"), ("http://www.example.com/résumé?country=Россия", "cp1251"), ("http://はじめよう.みんな/?query=サ&maxResults=5", "iso2022jp"), ]: canonicalized = canonicalize_url(url, encoding=enc) # if we canonicalize again, we ge the same result self.assertEqual( canonicalize_url(canonicalized, encoding=enc), canonicalized ) # without encoding, already canonicalized URL is canonicalized identically self.assertEqual(canonicalize_url(canonicalized), canonicalized) def test_canonicalize_url_idna_exceptions(self): # missing DNS label self.assertEqual( canonicalize_url("http://.example.com/résumé?q=résumé"), "http://.example.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) # DNS label too long self.assertEqual( canonicalize_url(f"http://www.{'example' * 11}.com/résumé?q=résumé"), f"http://www.{'example' * 11}.com/r%C3%A9sum%C3%A9?q=r%C3%A9sum%C3%A9", ) def test_preserve_nonfragment_hash(self): # don't decode `%23` to `#` self.assertEqual( canonicalize_url("http://www.example.com/path/to/%23/foo/bar"), "http://www.example.com/path/to/%23/foo/bar", ) self.assertEqual( canonicalize_url("http://www.example.com/path/to/%23/foo/bar#frag"), "http://www.example.com/path/to/%23/foo/bar", ) self.assertEqual( canonicalize_url( "http://www.example.com/path/to/%23/foo/bar#frag", keep_fragments=True ), "http://www.example.com/path/to/%23/foo/bar#frag", ) self.assertEqual( canonicalize_url( "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2Fpath%2Fto%2F%23%2Fbar%2Ffoo" ), "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2Fpath%2Fto%2F%23%2Fbar%2Ffoo", ) self.assertEqual( canonicalize_url( "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo#frag" ), "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo", ) self.assertEqual( canonicalize_url( "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo#frag", keep_fragments=True, ), "http://www.example.com/path/to/%23/foo/bar?url=http%3A%2F%2Fwww.example.com%2F%2Fpath%2Fto%2F%23%2Fbar%2Ffoo#frag", ) def test_strip_spaces(self): self.assertEqual( canonicalize_url(" https://example.com"), "https://example.com/" ) self.assertEqual( canonicalize_url("https://example.com "), "https://example.com/" ) self.assertEqual( canonicalize_url(" https://example.com "), "https://example.com/" ) class DataURITests(unittest.TestCase): def test_default_mediatype_charset(self): result = parse_data_uri("data:,A%20brief%20note") self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.media_type_parameters, {"charset": "US-ASCII"}) self.assertEqual(result.data, b"A brief note") def test_text_uri(self): result = parse_data_uri("data:,A%20brief%20note") self.assertEqual(result.data, b"A brief note") def test_bytes_uri(self): result = parse_data_uri(b"data:,A%20brief%20note") self.assertEqual(result.data, b"A brief note") def test_unicode_uri(self): result = parse_data_uri("data:,é") self.assertEqual(result.data, "é".encode()) def test_default_mediatype(self): result = parse_data_uri("data:;charset=iso-8859-7,%be%d3%be") self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.media_type_parameters, {"charset": "iso-8859-7"}) self.assertEqual(result.data, b"\xbe\xd3\xbe") def test_text_charset(self): result = parse_data_uri("data:text/plain;charset=iso-8859-7,%be%d3%be") self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.media_type_parameters, {"charset": "iso-8859-7"}) self.assertEqual(result.data, b"\xbe\xd3\xbe") def test_mediatype_parameters(self): result = parse_data_uri( "data:text/plain;" "foo=%22foo;bar%5C%22%22;" "charset=utf-8;" "bar=%22foo;%5C%22foo%20;/%20,%22," "%CE%8E%CE%A3%CE%8E" ) self.assertEqual(result.media_type, "text/plain") self.assertEqual( result.media_type_parameters, {"charset": "utf-8", "foo": 'foo;bar"', "bar": 'foo;"foo ;/ ,'}, ) self.assertEqual(result.data, b"\xce\x8e\xce\xa3\xce\x8e") def test_base64(self): result = parse_data_uri("data:text/plain;base64,SGVsbG8sIHdvcmxkLg%3D%3D") self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.data, b"Hello, world.") def test_base64_spaces(self): result = parse_data_uri( "data:text/plain;base64,SGVsb%20G8sIH%0A%20%20" "dvcm%20%20%20xk%20Lg%3D%0A%3D" ) self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.data, b"Hello, world.") result = parse_data_uri( "data:text/plain;base64,SGVsb G8sIH\n dvcm xk Lg%3D\n%3D" ) self.assertEqual(result.media_type, "text/plain") self.assertEqual(result.data, b"Hello, world.") def test_wrong_base64_param(self): with self.assertRaises(ValueError): parse_data_uri("data:text/plain;baes64,SGVsbG8sIHdvcmxkLg%3D%3D") def test_missing_comma(self): with self.assertRaises(ValueError): parse_data_uri("data:A%20brief%20note") def test_missing_scheme(self): with self.assertRaises(ValueError): parse_data_uri("text/plain,A%20brief%20note") def test_wrong_scheme(self): with self.assertRaises(ValueError): parse_data_uri("http://example.com/") def test_scheme_case_insensitive(self): result = parse_data_uri("DATA:,A%20brief%20note") self.assertEqual(result.data, b"A brief note") result = parse_data_uri("DaTa:,A%20brief%20note") self.assertEqual(result.data, b"A brief note") if __name__ == "__main__": unittest.main() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tests/test_util.py0000644000175100001660000000060414745713200015673 0ustar00runnerdockerfrom unittest import TestCase from pytest import raises from w3lib.util import to_bytes, to_unicode class ToBytesTestCase(TestCase): def test_type_error(self): with raises(TypeError): to_bytes(True) # type: ignore class ToUnicodeTestCase(TestCase): def test_type_error(self): with raises(TypeError): to_unicode(True) # type: ignore ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/tox.ini0000644000175100001660000000235314745713200013461 0ustar00runnerdocker# Tox (http://tox.testrun.org/) is a tool for running tests # in multiple virtualenvs. This configuration file will run the # test suite on all supported python versions. To use it, "pip install tox" # and then run "tox" from this directory. [tox] envlist = py39, py310, py311, py312, py313, pypy3.10, docs, pylint, typing, pre-commit, twinecheck [testenv] deps = pytest !=3.1.1, !=3.1.2 pytest-cov commands = python -m pytest \ --doctest-modules \ --cov=w3lib --cov-report=term --cov-report=xml \ {posargs:w3lib tests} [testenv:typing] basepython = python3 deps = # mypy would error if pytest (or its stub) not found pytest mypy==1.14.1 commands = mypy --strict {posargs: w3lib tests} [testenv:pylint] deps = {[testenv]deps} pylint==3.3.3 commands = pylint conftest.py docs setup.py tests w3lib [testenv:docs] changedir = docs deps = -rdocs/requirements.txt commands = sphinx-build -W -b html . {envtmpdir}/html [testenv:pre-commit] deps = pre-commit commands = pre-commit run --all-files --show-diff-on-failure skip_install = true [testenv:twinecheck] basepython = python3 deps = twine==6.1.0 build==1.2.2.post1 commands = python -m build --sdist twine check dist/* ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7840877 w3lib-2.3.1/w3lib/0000755000175100001660000000000014745713217013173 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/__init__.py0000644000175100001660000000015114745713200015271 0ustar00runnerdocker__version__ = "2.3.1" version_info = tuple(int(v) if v.isdigit() else v for v in __version__.split(".")) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/_infra.py0000644000175100001660000000072514745713200014777 0ustar00runnerdocker# https://infra.spec.whatwg.org/ import string # https://infra.spec.whatwg.org/commit-snapshots/59e0d16c1e3ba0e77c6a60bfc69a0929b8ffaa5d/#code-points _ASCII_TAB_OR_NEWLINE = "\t\n\r" _ASCII_WHITESPACE = "\t\n\x0c\r " _C0_CONTROL = "".join(chr(n) for n in range(32)) _C0_CONTROL_OR_SPACE = _C0_CONTROL + " " _ASCII_DIGIT = string.digits _ASCII_HEX_DIGIT = string.hexdigits _ASCII_ALPHA = string.ascii_letters _ASCII_ALPHANUMERIC = string.ascii_letters + string.digits ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/_types.py0000644000175100001660000000034414745713200015041 0ustar00runnerdockerfrom __future__ import annotations from typing import Union # the base class UnicodeError doesn't have attributes like start / end AnyUnicodeError = Union[UnicodeEncodeError, UnicodeDecodeError] StrOrBytes = Union[str, bytes] ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/_url.py0000644000175100001660000000045214745713200014477 0ustar00runnerdocker# https://url.spec.whatwg.org/ # https://url.spec.whatwg.org/commit-snapshots/a46cb9188a48c2c9d80ba32a9b1891652d6b4900/#default-port _DEFAULT_PORTS = { "ftp": 21, "file": None, "http": 80, "https": 443, "ws": 80, "wss": 443, } _SPECIAL_SCHEMES = set(_DEFAULT_PORTS.keys()) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/encoding.py0000644000175100001660000002416614745713200015334 0ustar00runnerdocker""" Functions for handling encoding of web pages """ from __future__ import annotations import codecs import encodings import re from re import Match from typing import Callable, cast import w3lib.util from w3lib._types import AnyUnicodeError, StrOrBytes _HEADER_ENCODING_RE = re.compile(r"charset=([\w-]+)", re.I) def http_content_type_encoding(content_type: str | None) -> str | None: """Extract the encoding in the content-type header >>> import w3lib.encoding >>> w3lib.encoding.http_content_type_encoding("Content-Type: text/html; charset=ISO-8859-4") 'iso8859-4' """ if content_type: match = _HEADER_ENCODING_RE.search(content_type) if match: return resolve_encoding(match.group(1)) return None # regexp for parsing HTTP meta tags _TEMPLATE = r"""%s\s*=\s*["']?\s*%s\s*["']?""" _SKIP_ATTRS = """(?:\\s+ [^=<>/\\s"'\x00-\x1f\x7f]+ # Attribute name (?:\\s*=\\s* (?: # ' and " are entity encoded (', "), so no need for \', \" '[^']*' # attr in ' | "[^"]*" # attr in " | [^'"\\s]+ # attr having no ' nor " ))? )*?""" # must be used with re.VERBOSE flag _HTTPEQUIV_RE = _TEMPLATE % ("http-equiv", "Content-Type") _CONTENT_RE = _TEMPLATE % ("content", r"(?P[^;]+);\s*charset=(?P[\w-]+)") _CONTENT2_RE = _TEMPLATE % ("charset", r"(?P[\w-]+)") _XML_ENCODING_RE = _TEMPLATE % ("encoding", r"(?P[\w-]+)") # check for meta tags, or xml decl. and stop search if a body tag is encountered _BODY_ENCODING_PATTERN = ( r"<\s*(?:meta%s(?:(?:\s+%s|\s+%s){2}|\s+%s)|\?xml\s[^>]+%s|body)" % (_SKIP_ATTRS, _HTTPEQUIV_RE, _CONTENT_RE, _CONTENT2_RE, _XML_ENCODING_RE) ) _BODY_ENCODING_STR_RE = re.compile(_BODY_ENCODING_PATTERN, re.I | re.VERBOSE) _BODY_ENCODING_BYTES_RE = re.compile( _BODY_ENCODING_PATTERN.encode("ascii"), re.I | re.VERBOSE ) def html_body_declared_encoding(html_body_str: StrOrBytes) -> str | None: '''Return the encoding specified in meta tags in the html body, or ``None`` if no suitable encoding was found >>> import w3lib.encoding >>> w3lib.encoding.html_body_declared_encoding( ... """ ... ... ... Some title ... ... ... ... ... ... ... """) 'utf-8' >>> ''' # html5 suggests the first 1024 bytes are sufficient, we allow for more chunk = html_body_str[:4096] match: Match[bytes] | Match[str] | None if isinstance(chunk, bytes): match = _BODY_ENCODING_BYTES_RE.search(chunk) else: match = _BODY_ENCODING_STR_RE.search(chunk) if match: encoding = ( match.group("charset") or match.group("charset2") or match.group("xmlcharset") ) if encoding: return resolve_encoding(w3lib.util.to_unicode(encoding)) return None # Default encoding translation # this maps cannonicalized encodings to target encodings # see http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#character-encodings-0 # in addition, gb18030 supercedes gb2312 & gbk # the keys are converted using _c18n_encoding and in sorted order DEFAULT_ENCODING_TRANSLATION = { "ascii": "cp1252", "big5": "big5hkscs", "euc_kr": "cp949", "gb2312": "gb18030", "gb_2312_80": "gb18030", "gbk": "gb18030", "iso8859_11": "cp874", "iso8859_9": "cp1254", "latin_1": "cp1252", "macintosh": "mac_roman", "shift_jis": "cp932", "tis_620": "cp874", "win_1251": "cp1251", "windows_31j": "cp932", "win_31j": "cp932", "windows_874": "cp874", "win_874": "cp874", "x_sjis": "cp932", "zh_cn": "gb18030", } def _c18n_encoding(encoding: str) -> str: """Canonicalize an encoding name This performs normalization and translates aliases using python's encoding aliases """ normed = encodings.normalize_encoding(encoding).lower() return cast(str, encodings.aliases.aliases.get(normed, normed)) def resolve_encoding(encoding_alias: str) -> str | None: """Return the encoding that `encoding_alias` maps to, or ``None`` if the encoding cannot be interpreted >>> import w3lib.encoding >>> w3lib.encoding.resolve_encoding('latin1') 'cp1252' >>> w3lib.encoding.resolve_encoding('gb_2312-80') 'gb18030' >>> """ c18n_encoding = _c18n_encoding(encoding_alias) translated = DEFAULT_ENCODING_TRANSLATION.get(c18n_encoding, c18n_encoding) try: return codecs.lookup(translated).name except LookupError: return None _BOM_TABLE = [ (codecs.BOM_UTF32_BE, "utf-32-be"), (codecs.BOM_UTF32_LE, "utf-32-le"), (codecs.BOM_UTF16_BE, "utf-16-be"), (codecs.BOM_UTF16_LE, "utf-16-le"), (codecs.BOM_UTF8, "utf-8"), ] _FIRST_CHARS = {c[0] for (c, _) in _BOM_TABLE} def read_bom(data: bytes) -> tuple[None, None] | tuple[str, bytes]: r"""Read the byte order mark in the text, if present, and return the encoding represented by the BOM and the BOM. If no BOM can be detected, ``(None, None)`` is returned. >>> import w3lib.encoding >>> w3lib.encoding.read_bom(b'\xfe\xff\x6c\x34') ('utf-16-be', '\xfe\xff') >>> w3lib.encoding.read_bom(b'\xff\xfe\x34\x6c') ('utf-16-le', '\xff\xfe') >>> w3lib.encoding.read_bom(b'\x00\x00\xfe\xff\x00\x00\x6c\x34') ('utf-32-be', '\x00\x00\xfe\xff') >>> w3lib.encoding.read_bom(b'\xff\xfe\x00\x00\x34\x6c\x00\x00') ('utf-32-le', '\xff\xfe\x00\x00') >>> w3lib.encoding.read_bom(b'\x01\x02\x03\x04') (None, None) >>> """ # common case is no BOM, so this is fast if data and data[0] in _FIRST_CHARS: for bom, encoding in _BOM_TABLE: if data.startswith(bom): return encoding, bom return None, None # Python decoder doesn't follow unicode standard when handling # bad utf-8 encoded strings. see http://bugs.python.org/issue8271 codecs.register_error( "w3lib_replace", lambda exc: ("\ufffd", cast(AnyUnicodeError, exc).end) ) def to_unicode(data_str: bytes, encoding: str) -> str: """Convert a str object to unicode using the encoding given Characters that cannot be converted will be converted to ``\\ufffd`` (the unicode replacement character). """ return data_str.decode(encoding, "replace") def html_to_unicode( content_type_header: str | None, html_body_str: bytes, default_encoding: str = "utf8", auto_detect_fun: Callable[[bytes], str | None] | None = None, ) -> tuple[str, str]: r'''Convert raw html bytes to unicode This attempts to make a reasonable guess at the content encoding of the html body, following a similar process to a web browser. It will try in order: * BOM (byte-order mark) * http content type header * meta or xml tag declarations * auto-detection, if the `auto_detect_fun` keyword argument is not ``None`` * default encoding in keyword arg (which defaults to utf8) If an encoding other than the auto-detected or default encoding is used, overrides will be applied, converting some character encodings to more suitable alternatives. If a BOM is found matching the encoding, it will be stripped. The `auto_detect_fun` argument can be used to pass a function that will sniff the encoding of the text. This function must take the raw text as an argument and return the name of an encoding that python can process, or None. To use chardet, for example, you can define the function as:: auto_detect_fun=lambda x: chardet.detect(x).get('encoding') or to use UnicodeDammit (shipped with the BeautifulSoup library):: auto_detect_fun=lambda x: UnicodeDammit(x).originalEncoding If the locale of the website or user language preference is known, then a better default encoding can be supplied. If `content_type_header` is not present, ``None`` can be passed signifying that the header was not present. This method will not fail, if characters cannot be converted to unicode, ``\\ufffd`` (the unicode replacement character) will be inserted instead. Returns a tuple of ``(, )`` Examples: >>> import w3lib.encoding >>> w3lib.encoding.html_to_unicode(None, ... b""" ... ... ... ... Creative Commons France ... ... ...

Creative Commons est une organisation \xc3\xa0 but non lucratif ... qui a pour dessein de faciliter la diffusion et le partage des oeuvres ... tout en accompagnant les nouvelles pratiques de cr\xc3\xa9ation \xc3\xa0 l\xe2\x80\x99\xc3\xa8re numerique.

... ... """) ('utf-8', '\n\n\n\nCreative Commons France\n\n\n

Creative Commons est une organisation \xe0 but non lucratif\nqui a pour dessein de faciliter la diffusion et le partage des oeuvres\ntout en accompagnant les nouvelles pratiques de cr\xe9ation \xe0 l\u2019\xe8re numerique.

\n\n') >>> ''' bom_enc, bom = read_bom(html_body_str) if bom_enc is not None: bom = cast(bytes, bom) return bom_enc, to_unicode(html_body_str[len(bom) :], bom_enc) enc = http_content_type_encoding(content_type_header) if enc is not None: if enc in {"utf-16", "utf-32"}: enc += "-be" return enc, to_unicode(html_body_str, enc) enc = html_body_declared_encoding(html_body_str) if enc is None and (auto_detect_fun is not None): enc = auto_detect_fun(html_body_str) if enc is None: enc = default_encoding return enc, to_unicode(html_body_str, enc) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/html.py0000644000175100001660000002712214745713200014505 0ustar00runnerdocker""" Functions for dealing with markup text """ from __future__ import annotations import re from collections.abc import Iterable from html.entities import name2codepoint from re import Match, Pattern from urllib.parse import urljoin from w3lib._types import StrOrBytes from w3lib.url import safe_url_string from w3lib.util import to_unicode _ent_re = re.compile( r"&((?P[a-z\d]+)|#(?P\d+)|#x(?P[a-f\d]+))(?P;?)", re.IGNORECASE, ) _tag_re = re.compile(r"<[a-zA-Z\/!].*?>", re.DOTALL) _baseurl_re = re.compile(r"]*href\s*=\s*[\"\']\s*([^\"\'\s]+)\s*[\"\']", re.I) _meta_refresh_re = re.compile( r']*http-equiv[^>]*refresh[^>]*content\s*=\s*(?P["\'])(?P(\d*\.)?\d+)\s*;\s*url=\s*(?P.*?)(?P=quote)', re.DOTALL | re.IGNORECASE, ) _meta_refresh_re2 = re.compile( r']*content\s*=\s*(?P["\'])(?P(\d*\.)?\d+)\s*;\s*url=\s*(?P.*?)(?P=quote)[^>]*?\shttp-equiv\s*=[^>]*refresh', re.DOTALL | re.IGNORECASE, ) _cdata_re = re.compile( r"((?P.*?)(?P\]\]>))", re.DOTALL ) HTML5_WHITESPACE = " \t\n\r\x0c" def replace_entities( text: StrOrBytes, keep: Iterable[str] = (), remove_illegal: bool = True, encoding: str = "utf-8", ) -> str: """Remove entities from the given `text` by converting them to their corresponding unicode character. `text` can be a unicode string or a byte string encoded in the given `encoding` (which defaults to 'utf-8'). If `keep` is passed (with a list of entity names) those entities will be kept (they won't be removed). It supports both numeric entities (``&#nnnn;`` and ``&#hhhh;``) and named entities (such as `` `` or ``>``). If `remove_illegal` is ``True``, entities that can't be converted are removed. If `remove_illegal` is ``False``, entities that can't be converted are kept "as is". For more information see the tests. Always returns a unicode string (with the entities removed). >>> import w3lib.html >>> w3lib.html.replace_entities(b'Price: £100') 'Price: \\xa3100' >>> print(w3lib.html.replace_entities(b'Price: £100')) Price: £100 >>> """ def convert_entity(m: Match[str]) -> str: groups = m.groupdict() number = None if groups.get("dec"): number = int(groups["dec"], 10) elif groups.get("hex"): number = int(groups["hex"], 16) elif groups.get("named"): entity_name = groups["named"] if entity_name.lower() in keep: return m.group(0) number = name2codepoint.get(entity_name) or name2codepoint.get( entity_name.lower() ) if number is not None: # Numeric character references in the 80-9F range are typically # interpreted by browsers as representing the characters mapped # to bytes 80-9F in the Windows-1252 encoding. For more info # see: http://en.wikipedia.org/wiki/Character_encodings_in_HTML try: if 0x80 <= number <= 0x9F: return bytes((number,)).decode("cp1252") return chr(number) except (ValueError, OverflowError): pass return "" if remove_illegal and groups.get("semicolon") else m.group(0) return _ent_re.sub(convert_entity, to_unicode(text, encoding)) def has_entities(text: StrOrBytes, encoding: str | None = None) -> bool: return bool(_ent_re.search(to_unicode(text, encoding))) def replace_tags(text: StrOrBytes, token: str = "", encoding: str | None = None) -> str: """Replace all markup tags found in the given `text` by the given token. By default `token` is an empty string so it just removes all tags. `text` can be a unicode string or a regular string encoded as `encoding` (or ``'utf-8'`` if `encoding` is not given.) Always returns a unicode string. Examples: >>> import w3lib.html >>> w3lib.html.replace_tags('This text contains some tag') 'This text contains some tag' >>> w3lib.html.replace_tags('

Je ne parle pas fran\\xe7ais

', ' -- ', 'latin-1') ' -- Je ne parle pas -- fran\\xe7ais -- -- ' >>> """ return _tag_re.sub(token, to_unicode(text, encoding)) _REMOVECOMMENTS_RE = re.compile("|$)", re.DOTALL) def remove_comments(text: StrOrBytes, encoding: str | None = None) -> str: """Remove HTML Comments. >>> import w3lib.html >>> w3lib.html.remove_comments(b"test whatever") 'test whatever' >>> """ utext = to_unicode(text, encoding) return _REMOVECOMMENTS_RE.sub("", utext) def remove_tags( text: StrOrBytes, which_ones: Iterable[str] = (), keep: Iterable[str] = (), encoding: str | None = None, ) -> str: """Remove HTML Tags only. `which_ones` and `keep` are both tuples, there are four cases: ============== ============= ========================================== ``which_ones`` ``keep`` what it does ============== ============= ========================================== **not empty** empty remove all tags in ``which_ones`` empty **not empty** remove all tags except the ones in ``keep`` empty empty remove all tags **not empty** **not empty** not allowed ============== ============= ========================================== Remove all tags: >>> import w3lib.html >>> doc = '

This is a link: example

' >>> w3lib.html.remove_tags(doc) 'This is a link: example' >>> Keep only some tags: >>> w3lib.html.remove_tags(doc, keep=('div',)) '
This is a link: example
' >>> Remove only specific tags: >>> w3lib.html.remove_tags(doc, which_ones=('a','b')) '

This is a link: example

' >>> You can't remove some and keep some: >>> w3lib.html.remove_tags(doc, which_ones=('a',), keep=('p',)) Traceback (most recent call last): ... ValueError: Cannot use both which_ones and keep >>> """ if which_ones and keep: raise ValueError("Cannot use both which_ones and keep") which_ones = {tag.lower() for tag in which_ones} keep = {tag.lower() for tag in keep} def will_remove(tag: str) -> bool: tag = tag.lower() if which_ones: return tag in which_ones return tag not in keep def remove_tag(m: Match[str]) -> str: tag = m.group(1) return "" if will_remove(tag) else m.group(0) regex = "/]+).*?>" retags = re.compile(regex, re.DOTALL | re.IGNORECASE) return retags.sub(remove_tag, to_unicode(text, encoding)) def remove_tags_with_content( text: StrOrBytes, which_ones: Iterable[str] = (), encoding: str | None = None ) -> str: """Remove tags and their content. `which_ones` is a tuple of which tags to remove including their content. If is empty, returns the string unmodified. >>> import w3lib.html >>> doc = '

This is a link: example

' >>> w3lib.html.remove_tags_with_content(doc, which_ones=('b',)) '' >>> """ utext = to_unicode(text, encoding) if which_ones: tags = "|".join([rf"<{tag}\b.*?|<{tag}\s*/>" for tag in which_ones]) retags = re.compile(tags, re.DOTALL | re.IGNORECASE) utext = retags.sub("", utext) return utext def replace_escape_chars( text: StrOrBytes, which_ones: Iterable[str] = ("\n", "\t", "\r"), replace_by: StrOrBytes = "", encoding: str | None = None, ) -> str: """Remove escape characters. `which_ones` is a tuple of which escape characters we want to remove. By default removes ``\\n``, ``\\t``, ``\\r``. `replace_by` is the string to replace the escape characters by. It defaults to ``''``, meaning the escape characters are removed. """ utext = to_unicode(text, encoding) for ec in which_ones: utext = utext.replace(ec, to_unicode(replace_by, encoding)) return utext def unquote_markup( text: StrOrBytes, keep: Iterable[str] = (), remove_illegal: bool = True, encoding: str | None = None, ) -> str: """ This function receives markup as a text (always a unicode string or a UTF-8 encoded string) and does the following: 1. removes entities (except the ones in `keep`) from any part of it that is not inside a CDATA 2. searches for CDATAs and extracts their text (if any) without modifying it. 3. removes the found CDATAs """ def _get_fragments(txt: str, pattern: Pattern[str]) -> Iterable[str | Match[str]]: offset = 0 for match in pattern.finditer(txt): match_s, match_e = match.span(1) yield txt[offset:match_s] yield match offset = match_e yield txt[offset:] utext = to_unicode(text, encoding) ret_text = "" for fragment in _get_fragments(utext, _cdata_re): if isinstance(fragment, str): # it's not a CDATA (so we try to remove its entities) ret_text += replace_entities( fragment, keep=keep, remove_illegal=remove_illegal ) else: # it's a CDATA (so we just extract its content) ret_text += fragment.group("cdata_d") return ret_text def get_base_url( text: StrOrBytes, baseurl: StrOrBytes = "", encoding: str = "utf-8" ) -> str: """Return the base url if declared in the given HTML `text`, relative to the given base url. If no base url is found, the given `baseurl` is returned. """ utext: str = remove_comments(text, encoding=encoding) if m := _baseurl_re.search(utext): return urljoin( safe_url_string(baseurl), safe_url_string(m.group(1), encoding=encoding) ) return safe_url_string(baseurl) def get_meta_refresh( text: StrOrBytes, baseurl: str = "", encoding: str = "utf-8", ignore_tags: Iterable[str] = ("script", "noscript"), ) -> tuple[None, None] | tuple[float, str]: """Return the http-equiv parameter of the HTML meta element from the given HTML text and return a tuple ``(interval, url)`` where interval is an integer containing the delay in seconds (or zero if not present) and url is a string with the absolute url to redirect. If no meta redirect is found, ``(None, None)`` is returned. """ try: utext = to_unicode(text, encoding) except UnicodeDecodeError: print(text) raise utext = remove_tags_with_content(utext, ignore_tags) utext = remove_comments(replace_entities(utext)) if m := _meta_refresh_re.search(utext) or _meta_refresh_re2.search(utext): interval = float(m.group("int")) url = safe_url_string(m.group("url").strip(" \"'"), encoding) url = urljoin(baseurl, url) return interval, url return None, None def strip_html5_whitespace(text: str) -> str: r""" Strip all leading and trailing space characters (as defined in https://www.w3.org/TR/html5/infrastructure.html#space-character). Such stripping is useful e.g. for processing HTML element attributes which contain URLs, like ``href``, ``src`` or form ``action`` - HTML5 standard defines them as "valid URL potentially surrounded by spaces" or "valid non-empty URL potentially surrounded by spaces". >>> strip_html5_whitespace(' hello\n') 'hello' """ return text.strip(HTML5_WHITESPACE) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/http.py0000644000175100001660000000662614745713200014526 0ustar00runnerdockerfrom __future__ import annotations from base64 import b64encode from collections.abc import Mapping, MutableMapping, Sequence from typing import Any, Union, overload from w3lib._types import StrOrBytes from w3lib.util import to_bytes, to_unicode HeadersDictInput = Mapping[bytes, Union[Any, Sequence[bytes]]] HeadersDictOutput = MutableMapping[bytes, list[bytes]] @overload def headers_raw_to_dict(headers_raw: bytes) -> HeadersDictOutput: ... @overload def headers_raw_to_dict(headers_raw: None) -> None: ... def headers_raw_to_dict(headers_raw: bytes | None) -> HeadersDictOutput | None: r""" Convert raw headers (single multi-line bytestring) to a dictionary. For example: >>> import w3lib.http >>> w3lib.http.headers_raw_to_dict(b"Content-type: text/html\n\rAccept: gzip\n\n") # doctest: +SKIP {'Content-type': ['text/html'], 'Accept': ['gzip']} Incorrect input: >>> w3lib.http.headers_raw_to_dict(b"Content-typt gzip\n\n") {} >>> Argument is ``None`` (return ``None``): >>> w3lib.http.headers_raw_to_dict(None) >>> """ if headers_raw is None: return None headers = headers_raw.splitlines() headers_tuples = [header.split(b":", 1) for header in headers] result_dict: HeadersDictOutput = {} for header_item in headers_tuples: if not len(header_item) == 2: continue item_key = header_item[0].strip() item_value = header_item[1].strip() if item_key in result_dict: result_dict[item_key].append(item_value) else: result_dict[item_key] = [item_value] return result_dict @overload def headers_dict_to_raw(headers_dict: HeadersDictInput) -> bytes: ... @overload def headers_dict_to_raw(headers_dict: None) -> None: ... def headers_dict_to_raw(headers_dict: HeadersDictInput | None) -> bytes | None: r""" Returns a raw HTTP headers representation of headers For example: >>> import w3lib.http >>> w3lib.http.headers_dict_to_raw({b'Content-type': b'text/html', b'Accept': b'gzip'}) # doctest: +SKIP 'Content-type: text/html\\r\\nAccept: gzip' >>> Note that keys and values must be bytes. Argument is ``None`` (returns ``None``): >>> w3lib.http.headers_dict_to_raw(None) >>> """ if headers_dict is None: return None raw_lines = [] for key, value in headers_dict.items(): if isinstance(value, bytes): raw_lines.append(b": ".join([key, value])) elif isinstance(value, (list, tuple)): for v in value: raw_lines.append(b": ".join([key, v])) return b"\r\n".join(raw_lines) def basic_auth_header( username: StrOrBytes, password: StrOrBytes, encoding: str = "ISO-8859-1" ) -> bytes: """ Return an `Authorization` header field value for `HTTP Basic Access Authentication (RFC 2617)`_ >>> import w3lib.http >>> w3lib.http.basic_auth_header('someuser', 'somepass') 'Basic c29tZXVzZXI6c29tZXBhc3M=' .. _HTTP Basic Access Authentication (RFC 2617): http://www.ietf.org/rfc/rfc2617.txt """ auth = f"{to_unicode(username)}:{to_unicode(password)}" # XXX: RFC 2617 doesn't define encoding, but ISO-8859-1 # seems to be the most widely used encoding here. See also: # http://greenbytes.de/tech/webdav/draft-ietf-httpauth-basicauth-enc-latest.html return b"Basic " + b64encode(to_bytes(auth, encoding=encoding)) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/py.typed0000644000175100001660000000000014745713200014650 0ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/url.py0000644000175100001660000005726214745713200014353 0ustar00runnerdocker""" This module contains general purpose URL functions not found in the standard library. """ from __future__ import annotations import base64 import codecs import os import posixpath import re import string from collections.abc import Sequence from typing import Callable, NamedTuple, cast, overload from urllib.parse import _coerce_args # type: ignore from urllib.parse import ( ParseResult, parse_qs, parse_qsl, quote, unquote, unquote_to_bytes, urldefrag, urlencode, urlparse, urlsplit, urlunparse, urlunsplit, ) from urllib.request import pathname2url, url2pathname from ._infra import _ASCII_TAB_OR_NEWLINE, _C0_CONTROL_OR_SPACE from ._types import AnyUnicodeError, StrOrBytes from ._url import _SPECIAL_SCHEMES from .util import to_unicode # error handling function for bytes-to-Unicode decoding errors with URLs def _quote_byte(error: UnicodeError) -> tuple[str, int]: error = cast(AnyUnicodeError, error) return (to_unicode(quote(error.object[error.start : error.end])), error.end) codecs.register_error("percentencode", _quote_byte) # constants from RFC 3986, Section 2.2 and 2.3 RFC3986_GEN_DELIMS = b":/?#[]@" RFC3986_SUB_DELIMS = b"!$&'()*+,;=" RFC3986_RESERVED = RFC3986_GEN_DELIMS + RFC3986_SUB_DELIMS RFC3986_UNRESERVED = (string.ascii_letters + string.digits + "-._~").encode("ascii") EXTRA_SAFE_CHARS = b"|" # see https://github.com/scrapy/w3lib/pull/25 RFC3986_USERINFO_SAFE_CHARS = RFC3986_UNRESERVED + RFC3986_SUB_DELIMS + b":" _safe_chars = RFC3986_RESERVED + RFC3986_UNRESERVED + EXTRA_SAFE_CHARS + b"%" _path_safe_chars = _safe_chars.replace(b"#", b"") # Characters that are safe in all of: # # - RFC 2396 + RFC 2732, as interpreted by Java 8’s java.net.URI class # - RFC 3986 # - The URL living standard # # NOTE: % is currently excluded from these lists of characters, due to # limitations of the current safe_url_string implementation, but it should also # be escaped as %25 when it is not already being used as part of an escape # character. _USERINFO_SAFEST_CHARS = RFC3986_USERINFO_SAFE_CHARS.translate(None, delete=b":;=") _PATH_SAFEST_CHARS = _safe_chars.translate(None, delete=b"#[]|") _QUERY_SAFEST_CHARS = _PATH_SAFEST_CHARS _SPECIAL_QUERY_SAFEST_CHARS = _PATH_SAFEST_CHARS.translate(None, delete=b"'") _FRAGMENT_SAFEST_CHARS = _PATH_SAFEST_CHARS _ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE = { ord(char): None for char in _ASCII_TAB_OR_NEWLINE } def _strip(url: str) -> str: return url.strip(_C0_CONTROL_OR_SPACE).translate( _ASCII_TAB_OR_NEWLINE_TRANSLATION_TABLE ) def safe_url_string( # pylint: disable=too-many-locals url: StrOrBytes, encoding: str = "utf8", path_encoding: str = "utf8", quote_path: bool = True, ) -> str: """Return a URL equivalent to *url* that a wide range of web browsers and web servers consider valid. *url* is parsed according to the rules of the `URL living standard`_, and during serialization additional characters are percent-encoded to make the URL valid by additional URL standards. .. _URL living standard: https://url.spec.whatwg.org/ The returned URL should be valid by *all* of the following URL standards known to be enforced by modern-day web browsers and web servers: - `URL living standard`_ - `RFC 3986`_ - `RFC 2396`_ and `RFC 2732`_, as interpreted by `Java 8’s java.net.URI class`_. .. _Java 8’s java.net.URI class: https://docs.oracle.com/javase/8/docs/api/java/net/URI.html .. _RFC 2396: https://www.ietf.org/rfc/rfc2396.txt .. _RFC 2732: https://www.ietf.org/rfc/rfc2732.txt .. _RFC 3986: https://www.ietf.org/rfc/rfc3986.txt If a bytes URL is given, it is first converted to `str` using the given encoding (which defaults to 'utf-8'). If quote_path is True (default), path_encoding ('utf-8' by default) is used to encode URL path component which is then quoted. Otherwise, if quote_path is False, path component is not encoded or quoted. Given encoding is used for query string or form data. When passing an encoding, you should use the encoding of the original page (the page from which the URL was extracted from). Calling this function on an already "safe" URL will return the URL unmodified. """ # urlsplit() chokes on bytes input with non-ASCII chars, # so let's decode (to Unicode) using page encoding: # - it is assumed that a raw bytes input comes from a document # encoded with the supplied encoding (or UTF8 by default) # - if the supplied (or default) encoding chokes, # percent-encode offending bytes decoded = to_unicode(url, encoding=encoding, errors="percentencode") parts = urlsplit(_strip(decoded)) username, password, hostname, port = ( parts.username, parts.password, parts.hostname, parts.port, ) netloc_bytes = b"" if username is not None or password is not None: if username is not None: safe_username = quote(unquote(username), _USERINFO_SAFEST_CHARS) netloc_bytes += safe_username.encode(encoding) if password is not None: netloc_bytes += b":" safe_password = quote(unquote(password), _USERINFO_SAFEST_CHARS) netloc_bytes += safe_password.encode(encoding) netloc_bytes += b"@" if hostname is not None: try: netloc_bytes += hostname.encode("idna") except UnicodeError: # IDNA encoding can fail for too long labels (>63 characters) or # missing labels (e.g. http://.example.com) netloc_bytes += hostname.encode(encoding) if port is not None: netloc_bytes += b":" netloc_bytes += str(port).encode(encoding) netloc = netloc_bytes.decode() # default encoding for path component SHOULD be UTF-8 if quote_path: path = quote(parts.path.encode(path_encoding), _PATH_SAFEST_CHARS) else: path = parts.path if parts.scheme in _SPECIAL_SCHEMES: query = quote(parts.query.encode(encoding), _SPECIAL_QUERY_SAFEST_CHARS) else: query = quote(parts.query.encode(encoding), _QUERY_SAFEST_CHARS) return urlunsplit( ( parts.scheme, netloc, path, query, quote(parts.fragment.encode(encoding), _FRAGMENT_SAFEST_CHARS), ) ) _parent_dirs = re.compile(r"/?(\.\./)+") def safe_download_url( url: StrOrBytes, encoding: str = "utf8", path_encoding: str = "utf8" ) -> str: """Make a url for download. This will call safe_url_string and then strip the fragment, if one exists. The path will be normalised. If the path is outside the document root, it will be changed to be within the document root. """ safe_url = safe_url_string(url, encoding, path_encoding) scheme, netloc, path, query, _ = urlsplit(safe_url) if path: path = _parent_dirs.sub("", posixpath.normpath(path)) if safe_url.endswith("/") and not path.endswith("/"): path += "/" else: path = "/" return urlunsplit((scheme, netloc, path, query, "")) def is_url(text: str) -> bool: return text.partition("://")[0] in ("file", "http", "https") @overload def url_query_parameter( url: StrOrBytes, parameter: str, default: None = None, keep_blank_values: bool | int = 0, ) -> str | None: ... @overload def url_query_parameter( url: StrOrBytes, parameter: str, default: str, keep_blank_values: bool | int = 0, ) -> str: ... def url_query_parameter( url: StrOrBytes, parameter: str, default: str | None = None, keep_blank_values: bool | int = 0, ) -> str | None: """Return the value of a url parameter, given the url and parameter name General case: >>> import w3lib.url >>> w3lib.url.url_query_parameter("product.html?id=200&foo=bar", "id") '200' >>> Return a default value if the parameter is not found: >>> w3lib.url.url_query_parameter("product.html?id=200&foo=bar", "notthere", "mydefault") 'mydefault' >>> Returns None if `keep_blank_values` not set or 0 (default): >>> w3lib.url.url_query_parameter("product.html?id=", "id") >>> Returns an empty string if `keep_blank_values` set to 1: >>> w3lib.url.url_query_parameter("product.html?id=", "id", keep_blank_values=1) '' >>> """ queryparams = parse_qs( urlsplit(str(url))[3], keep_blank_values=bool(keep_blank_values) ) if parameter in queryparams: return queryparams[parameter][0] return default def url_query_cleaner( url: StrOrBytes, parameterlist: StrOrBytes | Sequence[StrOrBytes] = (), sep: str = "&", kvsep: str = "=", remove: bool = False, unique: bool = True, keep_fragments: bool = False, ) -> str: """Clean URL arguments leaving only those passed in the parameterlist keeping order >>> import w3lib.url >>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ('id',)) 'product.html?id=200' >>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ['id', 'name']) 'product.html?id=200&name=wired' >>> If `unique` is ``False``, do not remove duplicated keys >>> w3lib.url.url_query_cleaner("product.html?d=1&e=b&d=2&d=3&other=other", ['d'], unique=False) 'product.html?d=1&d=2&d=3' >>> If `remove` is ``True``, leave only those **not in parameterlist**. >>> w3lib.url.url_query_cleaner("product.html?id=200&foo=bar&name=wired", ['id'], remove=True) 'product.html?foo=bar&name=wired' >>> w3lib.url.url_query_cleaner("product.html?id=2&foo=bar&name=wired", ['id', 'foo'], remove=True) 'product.html?name=wired' >>> By default, URL fragments are removed. If you need to preserve fragments, pass the ``keep_fragments`` argument as ``True``. >>> w3lib.url.url_query_cleaner('http://domain.tld/?bla=123#123123', ['bla'], remove=True, keep_fragments=True) 'http://domain.tld/#123123' """ if isinstance(parameterlist, (str, bytes)): parameterlist = [parameterlist] url, fragment = urldefrag(url) url = cast(str, url) fragment = cast(str, fragment) base, _, query = url.partition("?") seen = set() querylist = [] for ksv in query.split(sep): if not ksv: continue k, _, _ = ksv.partition(kvsep) if unique and k in seen: continue if remove and k in parameterlist: continue if not remove and k not in parameterlist: continue querylist.append(ksv) seen.add(k) url = "?".join([base, sep.join(querylist)]) if querylist else base if keep_fragments and fragment: url += "#" + fragment return url def _add_or_replace_parameters(url: str, params: dict[str, str]) -> str: parsed = urlsplit(url) current_args = parse_qsl(parsed.query, keep_blank_values=True) new_args = [] seen_params = set() for name, value in current_args: if name not in params: new_args.append((name, value)) elif name not in seen_params: new_args.append((name, params[name])) seen_params.add(name) not_modified_args = [ (name, value) for name, value in params.items() if name not in seen_params ] new_args += not_modified_args query = urlencode(new_args) return urlunsplit(parsed._replace(query=query)) def add_or_replace_parameter(url: str, name: str, new_value: str) -> str: """Add or remove a parameter to a given url >>> import w3lib.url >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php', 'arg', 'v') 'http://www.example.com/index.php?arg=v' >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', 'arg4', 'v4') 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3&arg4=v4' >>> w3lib.url.add_or_replace_parameter('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', 'arg3', 'v3new') 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3new' >>> """ return _add_or_replace_parameters(url, {name: new_value}) def add_or_replace_parameters(url: str, new_parameters: dict[str, str]) -> str: """Add or remove a parameters to a given url >>> import w3lib.url >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php', {'arg': 'v'}) 'http://www.example.com/index.php?arg=v' >>> args = {'arg4': 'v4', 'arg3': 'v3new'} >>> w3lib.url.add_or_replace_parameters('http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3', args) 'http://www.example.com/index.php?arg1=v1&arg2=v2&arg3=v3new&arg4=v4' >>> """ return _add_or_replace_parameters(url, new_parameters) def path_to_file_uri(path: str) -> str: """Convert local filesystem path to legal File URIs as described in: http://en.wikipedia.org/wiki/File_URI_scheme """ x = pathname2url(os.path.abspath(path)) return f"file:///{x.lstrip('/')}" def file_uri_to_path(uri: str) -> str: """Convert File URI to local filesystem path according to: http://en.wikipedia.org/wiki/File_URI_scheme """ uri_path = urlparse(uri).path return url2pathname(uri_path) def any_to_uri(uri_or_path: str) -> str: """If given a path name, return its File URI, otherwise return it unmodified """ if os.path.splitdrive(uri_or_path)[0]: return path_to_file_uri(uri_or_path) u = urlparse(uri_or_path) return uri_or_path if u.scheme else path_to_file_uri(uri_or_path) # ASCII characters. _char = set(map(chr, range(127))) # RFC 2045 token. _token = r"[{}]+".format( re.escape( "".join( _char - # Control characters. set(map(chr, range(0, 32))) - # tspecials and space. set('()<>@,;:\\"/[]?= ') ) ) ) # RFC 822 quoted-string, without surrounding quotation marks. _quoted_string = r"(?:[{}]|(?:\\[{}]))*".format( re.escape("".join(_char - {'"', "\\", "\r"})), re.escape("".join(_char)) ) # Encode the regular expression strings to make them into bytes, as Python 3 # bytes have no format() method, but bytes must be passed to re.compile() in # order to make a pattern object that can be used to match on bytes. # RFC 2397 mediatype. _mediatype_pattern = re.compile(r"{token}/{token}".format(token=_token).encode()) _mediatype_parameter_pattern = re.compile( r';({token})=(?:({token})|"({quoted})")'.format( token=_token, quoted=_quoted_string ).encode() ) class ParseDataURIResult(NamedTuple): """Named tuple returned by :func:`parse_data_uri`.""" #: MIME type type and subtype, separated by / (e.g. ``"text/plain"``). media_type: str #: MIME type parameters (e.g. ``{"charset": "US-ASCII"}``). media_type_parameters: dict[str, str] #: Data, decoded if it was encoded in base64 format. data: bytes def parse_data_uri(uri: StrOrBytes) -> ParseDataURIResult: """Parse a data: URI into :class:`ParseDataURIResult`.""" if not isinstance(uri, bytes): uri = safe_url_string(uri).encode("ascii") try: scheme, uri = uri.split(b":", 1) except ValueError: raise ValueError("invalid URI") if scheme.lower() != b"data": raise ValueError("not a data URI") # RFC 3986 section 2.1 allows percent encoding to escape characters that # would be interpreted as delimiters, implying that actual delimiters # should not be percent-encoded. # Decoding before parsing will allow malformed URIs with percent-encoded # delimiters, but it makes parsing easier and should not affect # well-formed URIs, as the delimiters used in this URI scheme are not # allowed, percent-encoded or not, in tokens. uri = unquote_to_bytes(uri) media_type = "text/plain" media_type_params = {} m = _mediatype_pattern.match(uri) if m: media_type = m.group().decode() uri = uri[m.end() :] else: media_type_params["charset"] = "US-ASCII" while True: m = _mediatype_parameter_pattern.match(uri) if m: attribute, value, value_quoted = m.groups() if value_quoted: value = re.sub(rb"\\(.)", rb"\1", value_quoted) media_type_params[attribute.decode()] = value.decode() uri = uri[m.end() :] else: break try: is_base64, data = uri.split(b",", 1) except ValueError: raise ValueError("invalid data URI") if is_base64: if is_base64 != b";base64": raise ValueError("invalid data URI") data = base64.b64decode(data) return ParseDataURIResult(media_type, media_type_params, data) __all__ = [ "add_or_replace_parameter", "add_or_replace_parameters", "any_to_uri", "canonicalize_url", "file_uri_to_path", "is_url", "parse_data_uri", "path_to_file_uri", "safe_download_url", "safe_url_string", "url_query_cleaner", "url_query_parameter", ] def _safe_ParseResult( parts: ParseResult, encoding: str = "utf8", path_encoding: str = "utf8" ) -> tuple[str, str, str, str, str, str]: # IDNA encoding can fail for too long labels (>63 characters) # or missing labels (e.g. http://.example.com) try: netloc = parts.netloc.encode("idna").decode() except UnicodeError: netloc = parts.netloc return ( parts.scheme, netloc, quote(parts.path.encode(path_encoding), _path_safe_chars), quote(parts.params.encode(path_encoding), _safe_chars), quote(parts.query.encode(encoding), _safe_chars), quote(parts.fragment.encode(encoding), _safe_chars), ) def canonicalize_url( url: StrOrBytes | ParseResult, keep_blank_values: bool = True, keep_fragments: bool = False, encoding: str | None = None, ) -> str: r"""Canonicalize the given url by applying the following procedures: - make the URL safe - sort query arguments, first by key, then by value - normalize all spaces (in query arguments) '+' (plus symbol) - normalize percent encodings case (%2f -> %2F) - remove query arguments with blank values (unless `keep_blank_values` is True) - remove fragments (unless `keep_fragments` is True) The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3). >>> import w3lib.url >>> >>> # sorting query arguments >>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50') 'http://www.example.com/do?a=50&b=2&b=5&c=3' >>> >>> # UTF-8 conversion + percent-encoding of non-ASCII characters >>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9') 'http://www.example.com/r%C3%A9sum%C3%A9' >>> For more examples, see the tests in `tests/test_url.py`. """ # If supplied `encoding` is not compatible with all characters in `url`, # fallback to UTF-8 as safety net. # UTF-8 can handle all Unicode characters, # so we should be covered regarding URL normalization, # if not for proper URL expected by remote website. if isinstance(url, str): url = _strip(url) try: scheme, netloc, path, params, query, fragment = _safe_ParseResult( parse_url(url), encoding=encoding or "utf8" ) except UnicodeEncodeError: scheme, netloc, path, params, query, fragment = _safe_ParseResult( parse_url(url), encoding="utf8" ) # 1. decode query-string as UTF-8 (or keep raw bytes), # sort values, # and percent-encode them back # Python's urllib.parse.parse_qsl does not work as wanted # for percent-encoded characters that do not match passed encoding, # they get lost. # # e.g., 'q=b%a3' becomes [('q', 'b\ufffd')] # (ie. with 'REPLACEMENT CHARACTER' (U+FFFD), # instead of \xa3 that you get with Python2's parse_qsl) # # what we want here is to keep raw bytes, and percent encode them # so as to preserve whatever encoding what originally used. # # See https://tools.ietf.org/html/rfc3987#section-6.4: # # For example, it is possible to have a URI reference of # "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the # document name is encoded in iso-8859-1 based on server settings, but # where the fragment identifier is encoded in UTF-8 according to # [XPointer]. The IRI corresponding to the above URI would be (in XML # notation) # "http://www.example.org/r%E9sum%E9.xml#résumé". # Similar considerations apply to query parts. The functionality of # IRIs (namely, to be able to include non-ASCII characters) can only be # used if the query part is encoded in UTF-8. keyvals = parse_qsl_to_bytes(query, keep_blank_values) keyvals.sort() query = urlencode(keyvals) # 2. decode percent-encoded sequences in path as UTF-8 (or keep raw bytes) # and percent-encode path again (this normalizes to upper-case %XX) uqp = _unquotepath(path) path = quote(uqp, _path_safe_chars) or "/" fragment = "" if not keep_fragments else fragment # Apply lowercase to the domain, but not to the userinfo. netloc_parts = netloc.split("@") netloc_parts[-1] = netloc_parts[-1].lower().rstrip(":") netloc = "@".join(netloc_parts) # every part should be safe already return urlunparse((scheme, netloc, path, params, query, fragment)) def _unquotepath(path: str) -> bytes: for reserved in ("2f", "2F", "3f", "3F"): path = path.replace("%" + reserved, "%25" + reserved.upper()) # standard lib's unquote() does not work for non-UTF-8 # percent-escaped characters, they get lost. # e.g., '%a3' becomes 'REPLACEMENT CHARACTER' (U+FFFD) # # unquote_to_bytes() returns raw bytes instead return unquote_to_bytes(path) def parse_url( url: StrOrBytes | ParseResult, encoding: str | None = None ) -> ParseResult: """Return urlparsed url from the given argument (which could be an already parsed url) """ if isinstance(url, ParseResult): return url return urlparse(to_unicode(url, encoding)) def parse_qsl_to_bytes( qs: str, keep_blank_values: bool = False ) -> list[tuple[bytes, bytes]]: """Parse a query given as a string argument. Data are returned as a list of name, value pairs as bytes. Arguments: qs: percent-encoded query string to be parsed keep_blank_values: flag indicating whether blank values in percent-encoded queries should be treated as blank strings. A true value indicates that blanks should be retained as blank strings. The default false value indicates that blank values are to be ignored and treated as if they were not included. """ # This code is the same as Python3's parse_qsl() # (at https://hg.python.org/cpython/rev/c38ac7ab8d9a) # except for the unquote(s, encoding, errors) calls replaced # with unquote_to_bytes(s) coerce_args = cast(Callable[..., tuple[str, Callable[..., bytes]]], _coerce_args) qs, _coerce_result = coerce_args(qs) pairs = [s2 for s1 in qs.split("&") for s2 in s1.split(";")] r = [] for name_value in pairs: if not name_value: continue nv = name_value.split("=", 1) if len(nv) != 2: # Handle case of a control-name with no equal sign if keep_blank_values: nv.append("") else: continue if len(nv[1]) or keep_blank_values: name: StrOrBytes = nv[0].replace("+", " ") name = unquote_to_bytes(name) name = _coerce_result(name) value: StrOrBytes = nv[1].replace("+", " ") value = unquote_to_bytes(value) value = _coerce_result(value) r.append((name, value)) return r ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987712.0 w3lib-2.3.1/w3lib/util.py0000644000175100001660000000212614745713200014513 0ustar00runnerdockerfrom __future__ import annotations from w3lib._types import StrOrBytes def to_unicode( text: StrOrBytes, encoding: str | None = None, errors: str = "strict" ) -> str: """Return the unicode representation of a bytes object `text`. If `text` is already an unicode object, return it as-is.""" if isinstance(text, str): return text if not isinstance(text, (bytes, str)): raise TypeError( f"to_unicode must receive bytes or str, got {type(text).__name__}" ) if encoding is None: encoding = "utf-8" return text.decode(encoding, errors) def to_bytes( text: StrOrBytes, encoding: str | None = None, errors: str = "strict" ) -> bytes: """Return the binary representation of `text`. If `text` is already a bytes object, return it as-is.""" if isinstance(text, bytes): return text if not isinstance(text, str): raise TypeError( f"to_bytes must receive str or bytes, got {type(text).__name__}" ) if encoding is None: encoding = "utf-8" return text.encode(encoding, errors) ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1737987726.7860878 w3lib-2.3.1/w3lib.egg-info/0000755000175100001660000000000014745713217014665 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987726.0 w3lib-2.3.1/w3lib.egg-info/PKG-INFO0000644000175100001660000000441414745713216015764 0ustar00runnerdockerMetadata-Version: 2.2 Name: w3lib Version: 2.3.1 Summary: Library of web-related functions Home-page: https://github.com/scrapy/w3lib Author: Scrapy project Author-email: info@scrapy.org License: BSD Project-URL: Documentation, https://w3lib.readthedocs.io/en/latest/ Project-URL: Source Code, https://github.com/scrapy/w3lib Project-URL: Issue Tracker, https://github.com/scrapy/w3lib/issues Platform: Any Classifier: Development Status :: 5 - Production/Stable Classifier: License :: OSI Approved :: BSD License Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Programming Language :: Python :: 3.13 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Internet :: WWW/HTTP Requires-Python: >=3.9 Description-Content-Type: text/x-rst License-File: LICENSE Dynamic: author Dynamic: author-email Dynamic: classifier Dynamic: description Dynamic: description-content-type Dynamic: home-page Dynamic: license Dynamic: platform Dynamic: project-url Dynamic: requires-python Dynamic: summary ===== w3lib ===== .. image:: https://github.com/scrapy/w3lib/actions/workflows/tests.yml/badge.svg :target: https://github.com/scrapy/w3lib/actions .. image:: https://img.shields.io/codecov/c/github/scrapy/w3lib/master.svg :target: http://codecov.io/github/scrapy/w3lib?branch=master :alt: Coverage report Overview ======== This is a Python library of web-related functions, such as: * remove comments, or tags from HTML snippets * extract base url from HTML snippets * translate entites on HTML strings * convert raw HTTP headers to dicts and vice-versa * construct HTTP auth header * converting HTML pages to unicode * sanitize urls (like browsers do) * extract arguments from urls Requirements ============ Python 3.9+ Install ======= ``pip install w3lib`` Documentation ============= See http://w3lib.readthedocs.org/ License ======= The w3lib library is licensed under the BSD license. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987726.0 w3lib-2.3.1/w3lib.egg-info/SOURCES.txt0000644000175100001660000000104214745713216016545 0ustar00runnerdockerLICENSE MANIFEST.in NEWS README.rst pytest.ini setup.py tox.ini docs/Makefile docs/conf.py docs/index.rst docs/make.bat docs/w3lib.rst tests/__init__.py tests/test_encoding.py tests/test_html.py tests/test_http.py tests/test_url.py tests/test_util.py w3lib/__init__.py w3lib/_infra.py w3lib/_types.py w3lib/_url.py w3lib/encoding.py w3lib/html.py w3lib/http.py w3lib/py.typed w3lib/url.py w3lib/util.py w3lib.egg-info/PKG-INFO w3lib.egg-info/SOURCES.txt w3lib.egg-info/dependency_links.txt w3lib.egg-info/not-zip-safe w3lib.egg-info/top_level.txt././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987726.0 w3lib-2.3.1/w3lib.egg-info/dependency_links.txt0000644000175100001660000000000114745713216020732 0ustar00runnerdocker ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987726.0 w3lib-2.3.1/w3lib.egg-info/not-zip-safe0000644000175100001660000000000114745713216017112 0ustar00runnerdocker ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1737987726.0 w3lib-2.3.1/w3lib.egg-info/top_level.txt0000644000175100001660000000000614745713216017412 0ustar00runnerdockerw3lib