././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7319238 MechanicalSoup-1.3.0/0000755000175100001720000000000014451070531014006 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/LICENSE0000644000175100001720000000205214451070517015016 0ustar00runnerdockerThe MIT License (MIT) Copyright (c) 2014 Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/MANIFEST.in0000644000175100001720000000022414451070517015546 0ustar00runnerdockerinclude LICENSE README.rst recursive-include tests *.py include examples/example*.py include requirements.txt tests/requirements.txt include docs/* ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7279239 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/0000755000175100001720000000000014451070531020373 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498520.0 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/PKG-INFO0000644000175100001720000001347714451070530021503 0ustar00runnerdockerMetadata-Version: 2.1 Name: MechanicalSoup Version: 1.3.0 Summary: A Python library for automating interaction with websites Home-page: https://mechanicalsoup.readthedocs.io/ License: MIT Project-URL: Source, https://github.com/MechanicalSoup/MechanicalSoup Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3 :: Only Requires-Python: >=3.6 License-File: LICENSE .. image:: https://raw.githubusercontent.com/MechanicalSoup/MechanicalSoup/main/assets/mechanical-soup-logo.png :alt: MechanicalSoup. A Python library for automating website interaction. Home page --------- https://mechanicalsoup.readthedocs.io/ Overview -------- A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript. MechanicalSoup was created by `M Hickford `__, who was a fond user of the `Mechanize `__ library. Unfortunately, Mechanize was `incompatible with Python 3 until 2019 `__ and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants `Requests `__ (for HTTP sessions) and `BeautifulSoup `__ (for document navigation). Since 2017 it is a project actively maintained by a small team including `@hemberger `__ and `@moy `__. |Gitter Chat| Installation ------------ |Latest Version| |Supported Versions| PyPy3 is also supported (and tested against). Download and install the latest released version from `PyPI `__:: pip install MechanicalSoup Download and install the development version from `GitHub `__:: pip install git+https://github.com/MechanicalSoup/MechanicalSoup Installing from source (installs the version in the current working directory):: python setup.py install (In all cases, add ``--user`` to the ``install`` command to install in the current user's home directory.) Documentation ------------- The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the `automatically generated API documentation `__. Example ------- From `examples/expl_qwant.py `__, code to get the results from a Qwant search: .. code:: python """Example usage of MechanicalSoup to get the results from the Qwant search engine. """ import re import mechanicalsoup import html import urllib.parse # Connect to Qwant browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup') browser.open("https://lite.qwant.com/") # Fill-in the search form browser.select_form('#search-form') browser["q"] = "MechanicalSoup" browser.submit_selected() # Display the results for link in browser.page.select('.result a'): # Qwant shows redirection links, not the actual URL, so extract # the actual URL from the redirect link: href = link.attrs['href'] m = re.match(r"^/redirect/[^/]*/(.*)$", href) if m: href = urllib.parse.unquote(m.group(1)) print(link.text, '->', href) More examples are available in `examples/ `__. For an example with a more complex form (checkboxes, radio buttons and textareas), read `tests/test_browser.py `__ and `tests/test_form.py `__. Development ----------- |Build Status| |Coverage Status| |Documentation Status| |CII Best Practices| Instructions for building, testing and contributing to MechanicalSoup: see `CONTRIBUTING.rst `__. Common problems --------------- Read the `FAQ `__. .. |Latest Version| image:: https://img.shields.io/pypi/v/MechanicalSoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Supported Versions| image:: https://img.shields.io/pypi/pyversions/mechanicalsoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Build Status| image:: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml/badge.svg?branch=main :target: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml?query=branch%3Amain .. |Coverage Status| image:: https://codecov.io/gh/MechanicalSoup/MechanicalSoup/branch/main/graph/badge.svg :target: https://codecov.io/gh/MechanicalSoup/MechanicalSoup .. |Documentation Status| image:: https://readthedocs.org/projects/mechanicalsoup/badge/?version=latest :target: https://mechanicalsoup.readthedocs.io/en/latest/?badge=latest .. |CII Best Practices| image:: https://bestpractices.coreinfrastructure.org/projects/1334/badge :target: https://bestpractices.coreinfrastructure.org/projects/1334 .. |Gitter Chat| image:: https://badges.gitter.im/MechanicalSoup/MechanicalSoup.svg :target: https://gitter.im/MechanicalSoup/Lobby ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498520.0 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/SOURCES.txt0000644000175100001720000000143214451070530022256 0ustar00runnerdockerLICENSE MANIFEST.in README.rst requirements.txt setup.cfg setup.py MechanicalSoup.egg-info/PKG-INFO MechanicalSoup.egg-info/SOURCES.txt MechanicalSoup.egg-info/dependency_links.txt MechanicalSoup.egg-info/requires.txt MechanicalSoup.egg-info/top_level.txt docs/ChangeLog.rst docs/Makefile docs/conf.py docs/external-resources.rst docs/faq.rst docs/index.rst docs/introduction.rst docs/make.bat docs/mechanicalsoup.rst docs/tutorial.rst examples/example.py examples/example_manual.py mechanicalsoup/__init__.py mechanicalsoup/__version__.py mechanicalsoup/browser.py mechanicalsoup/form.py mechanicalsoup/stateful_browser.py mechanicalsoup/utils.py tests/requirements.txt tests/setpath.py tests/test_browser.py tests/test_form.py tests/test_stateful_browser.py tests/test_utils.py tests/utils.py././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498520.0 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/dependency_links.txt0000644000175100001720000000000114451070530024440 0ustar00runnerdocker ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498520.0 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/requires.txt0000644000175100001720000000005214451070530022767 0ustar00runnerdockerrequests>=2.22.0 beautifulsoup4>=4.7 lxml ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498520.0 MechanicalSoup-1.3.0/MechanicalSoup.egg-info/top_level.txt0000644000175100001720000000001714451070530023122 0ustar00runnerdockermechanicalsoup ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7319238 MechanicalSoup-1.3.0/PKG-INFO0000644000175100001720000001347714451070531015117 0ustar00runnerdockerMetadata-Version: 2.1 Name: MechanicalSoup Version: 1.3.0 Summary: A Python library for automating interaction with websites Home-page: https://mechanicalsoup.readthedocs.io/ License: MIT Project-URL: Source, https://github.com/MechanicalSoup/MechanicalSoup Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3 :: Only Requires-Python: >=3.6 License-File: LICENSE .. image:: https://raw.githubusercontent.com/MechanicalSoup/MechanicalSoup/main/assets/mechanical-soup-logo.png :alt: MechanicalSoup. A Python library for automating website interaction. Home page --------- https://mechanicalsoup.readthedocs.io/ Overview -------- A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript. MechanicalSoup was created by `M Hickford `__, who was a fond user of the `Mechanize `__ library. Unfortunately, Mechanize was `incompatible with Python 3 until 2019 `__ and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants `Requests `__ (for HTTP sessions) and `BeautifulSoup `__ (for document navigation). Since 2017 it is a project actively maintained by a small team including `@hemberger `__ and `@moy `__. |Gitter Chat| Installation ------------ |Latest Version| |Supported Versions| PyPy3 is also supported (and tested against). Download and install the latest released version from `PyPI `__:: pip install MechanicalSoup Download and install the development version from `GitHub `__:: pip install git+https://github.com/MechanicalSoup/MechanicalSoup Installing from source (installs the version in the current working directory):: python setup.py install (In all cases, add ``--user`` to the ``install`` command to install in the current user's home directory.) Documentation ------------- The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the `automatically generated API documentation `__. Example ------- From `examples/expl_qwant.py `__, code to get the results from a Qwant search: .. code:: python """Example usage of MechanicalSoup to get the results from the Qwant search engine. """ import re import mechanicalsoup import html import urllib.parse # Connect to Qwant browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup') browser.open("https://lite.qwant.com/") # Fill-in the search form browser.select_form('#search-form') browser["q"] = "MechanicalSoup" browser.submit_selected() # Display the results for link in browser.page.select('.result a'): # Qwant shows redirection links, not the actual URL, so extract # the actual URL from the redirect link: href = link.attrs['href'] m = re.match(r"^/redirect/[^/]*/(.*)$", href) if m: href = urllib.parse.unquote(m.group(1)) print(link.text, '->', href) More examples are available in `examples/ `__. For an example with a more complex form (checkboxes, radio buttons and textareas), read `tests/test_browser.py `__ and `tests/test_form.py `__. Development ----------- |Build Status| |Coverage Status| |Documentation Status| |CII Best Practices| Instructions for building, testing and contributing to MechanicalSoup: see `CONTRIBUTING.rst `__. Common problems --------------- Read the `FAQ `__. .. |Latest Version| image:: https://img.shields.io/pypi/v/MechanicalSoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Supported Versions| image:: https://img.shields.io/pypi/pyversions/mechanicalsoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Build Status| image:: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml/badge.svg?branch=main :target: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml?query=branch%3Amain .. |Coverage Status| image:: https://codecov.io/gh/MechanicalSoup/MechanicalSoup/branch/main/graph/badge.svg :target: https://codecov.io/gh/MechanicalSoup/MechanicalSoup .. |Documentation Status| image:: https://readthedocs.org/projects/mechanicalsoup/badge/?version=latest :target: https://mechanicalsoup.readthedocs.io/en/latest/?badge=latest .. |CII Best Practices| image:: https://bestpractices.coreinfrastructure.org/projects/1334/badge :target: https://bestpractices.coreinfrastructure.org/projects/1334 .. |Gitter Chat| image:: https://badges.gitter.im/MechanicalSoup/MechanicalSoup.svg :target: https://gitter.im/MechanicalSoup/Lobby ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/README.rst0000644000175100001720000001120014451070517015473 0ustar00runnerdocker.. image:: /assets/mechanical-soup-logo.png :alt: MechanicalSoup. A Python library for automating website interaction. Home page --------- https://mechanicalsoup.readthedocs.io/ Overview -------- A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do JavaScript. MechanicalSoup was created by `M Hickford `__, who was a fond user of the `Mechanize `__ library. Unfortunately, Mechanize was `incompatible with Python 3 until 2019 `__ and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants `Requests `__ (for HTTP sessions) and `BeautifulSoup `__ (for document navigation). Since 2017 it is a project actively maintained by a small team including `@hemberger `__ and `@moy `__. |Gitter Chat| Installation ------------ |Latest Version| |Supported Versions| PyPy3 is also supported (and tested against). Download and install the latest released version from `PyPI `__:: pip install MechanicalSoup Download and install the development version from `GitHub `__:: pip install git+https://github.com/MechanicalSoup/MechanicalSoup Installing from source (installs the version in the current working directory):: python setup.py install (In all cases, add ``--user`` to the ``install`` command to install in the current user's home directory.) Documentation ------------- The full documentation is available on https://mechanicalsoup.readthedocs.io/. You may want to jump directly to the `automatically generated API documentation `__. Example ------- From ``__, code to get the results from a Qwant search: .. code:: python """Example usage of MechanicalSoup to get the results from the Qwant search engine. """ import re import mechanicalsoup import html import urllib.parse # Connect to Qwant browser = mechanicalsoup.StatefulBrowser(user_agent='MechanicalSoup') browser.open("https://lite.qwant.com/") # Fill-in the search form browser.select_form('#search-form') browser["q"] = "MechanicalSoup" browser.submit_selected() # Display the results for link in browser.page.select('.result a'): # Qwant shows redirection links, not the actual URL, so extract # the actual URL from the redirect link: href = link.attrs['href'] m = re.match(r"^/redirect/[^/]*/(.*)$", href) if m: href = urllib.parse.unquote(m.group(1)) print(link.text, '->', href) More examples are available in ``__. For an example with a more complex form (checkboxes, radio buttons and textareas), read ``__ and ``__. Development ----------- |Build Status| |Coverage Status| |Documentation Status| |CII Best Practices| Instructions for building, testing and contributing to MechanicalSoup: see ``__. Common problems --------------- Read the `FAQ `__. .. |Latest Version| image:: https://img.shields.io/pypi/v/MechanicalSoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Supported Versions| image:: https://img.shields.io/pypi/pyversions/mechanicalsoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Build Status| image:: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml/badge.svg?branch=main :target: https://github.com/MechanicalSoup/MechanicalSoup/actions/workflows/python-package.yml?query=branch%3Amain .. |Coverage Status| image:: https://codecov.io/gh/MechanicalSoup/MechanicalSoup/branch/main/graph/badge.svg :target: https://codecov.io/gh/MechanicalSoup/MechanicalSoup .. |Documentation Status| image:: https://readthedocs.org/projects/mechanicalsoup/badge/?version=latest :target: https://mechanicalsoup.readthedocs.io/en/latest/?badge=latest .. |CII Best Practices| image:: https://bestpractices.coreinfrastructure.org/projects/1334/badge :target: https://bestpractices.coreinfrastructure.org/projects/1334 .. |Gitter Chat| image:: https://badges.gitter.im/MechanicalSoup/MechanicalSoup.svg :target: https://gitter.im/MechanicalSoup/Lobby ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7279239 MechanicalSoup-1.3.0/docs/0000755000175100001720000000000014451070531014736 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/ChangeLog.rst0000644000175100001720000003317414451070517017333 0ustar00runnerdocker============= Release Notes ============= Version 1.3 =========== Breaking changes ---------------- * To prevent malicious web servers from reading arbitrary files from the client, files must now be opened explicitly by the user in order to upload their contents in form submission. For example, instead of: browser["upload"] = "/path/to/file" you would now use: browser["upload"] = open("/path/to/file", "rb") This remediates `CVE-2023-34457 `__. Our thanks to @e-c-d for reporting and helping to fix the vulnerability! Main changes ------------ * Added support for Python 3.11. * Allow submitting a form with no submit element. This can be achieved by passing ``submit=False`` to ``StatefulBrowser.submit_selected``. Thanks @alexreg! [`#480 `__] Bug fixes --------- * When uploading a file, only the filename is now submitted to the server. Previously, the full file path was being submitted, which exposed more local information than users may have been expecting. [`#375 `__] Version 1.1 =========== Main changes ------------ * Dropped support for EOL Python versions: 2.7 and 3.5. * Increased minimum version requirement for requests from 2.0 to 2.22.0 and beautifulsoup4 from 4.4 to 4.7. * Use encoding from the HTTP request when no HTML encoding is specified. [`#355 `__] * Added the ``put`` method to the ``Browser`` class. This is a light wrapper around ``requests.Session.put``. [`#359 `__] * Don't override ``Referer`` headers passed in by the user. [`#364 `__] * ``StatefulBrowser`` methods ``follow_link`` and ``download_link`` now support passing a dictionary of keyword arguments to ``requests``, via ``requests_kwargs``. For symmetry, they also support passing Beautiful Soup args in as ``bs4_kwargs``, although any excess ``**kwargs`` are sent to Beautiful Soup as well, just as they were previously. [`#368 `__] Version 1.0 =========== This is the last release that will support Python 2.7. Thanks to the many contributors that made this release possible! Main changes: ------------- * Added support for Python 3.8 and 3.9. * ``StatefulBrowser`` has new properties ``page``, ``form``, and ``url``, which can be used in place of the methods ``get_current_page``, ``get_current_form`` and ``get_url`` respectively (e.g. the new ``x.page`` is equivalent to ``x.get_current_page()``). These methods may be deprecated in a future release. [`#175 `__] * ``StatefulBrowser.form`` will raise an ``AttributeError`` instead of returning ``None`` if no form has been selected yet. Note that ``StatefulBrowser.get_current_form()`` still returns ``None`` for backward compatibility. Bug fixes --------- * Decompose ```` element. Bug fixes --------- * Checking checkboxes with ``browser["name"] = ("val1", "val2")`` now unchecks all checkbox except the ones explicitly specified. * ``StatefulBrowser.submit_selected`` and ``StatefulBrowser.open`` now reset __current_page to None when the result is not an HTML page. This fixes a bug where __current_page was still the previous page. * We don't error out anymore when trying to uncheck a box which doesn't have a ``checkbox`` attribute. * ``Form.new_control`` now correctly overrides existing elements. Internal changes ---------------- * The testsuite has been further improved and reached 100% coverage. * Tests are now run against the local version of MechanicalSoup, not against the installed version. * ``Browser.add_soup`` will now always attach a *soup*-attribute. If the response is not text/html, then soup is set to None. * ``Form.set(force=True)`` creates an ```` element instead of an ````. Version 0.8 =========== Main changes: ------------- * `Browser` and `StatefulBrowser` can now be configured to raise a `LinkNotFound` exception when encountering a 404 Not Found error. This is activated by passing `raise_on_404=True` to the constructor. It is disabled by default for backward compatibility, but is highly recommended. * `Browser` now has a `__del__` method that closes the current session when the object is deleted. * A `Link` object can now be passed to `follow_link`. * The user agent can now be customized. The default includes `MechanicalSoup` and its version. * There is now a direct interface to the cookiejar in `*Browser` classes (`(set|get)_cookiejar` methods). * This is the last MechanicalSoup version supporting Python 2.6 and 3.3. Bug fixes: ---------- * We used to crash on forms without action="..." fields. * The `choose_submit` method has been fixed, and the `btnName` argument of `StatefulBrowser.submit_selected` is now a shortcut for using `choose_submit`. * Arguments to `open_relative` were not properly forwarded. Internal changes: ----------------- * The testsuite has been greatly improved. It now uses the pytest API (not only the `pytest` launcher) for more concise code. * The coverage of the testsuite is now measured with codecov.io. The results can be viewed on: https://codecov.io/gh/hickford/MechanicalSoup * We now have a requires.io badge to help us tracking issues with dependencies. The report can be viewed on: https://requires.io/github/hickford/MechanicalSoup/requirements/ * The version number now appears in a single place in the source code. Version 0.7 =========== see Git history, no changelog sorry. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/Makefile0000644000175100001720000001553314451070517016411 0ustar00runnerdocker# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" .PHONY: apidoc # Create a list of modules with the proper .. automodule:: directive. # Use --no-toc to avoid creating a list containing only one module. apidoc: sphinx-apidoc --no-toc -o . ../mechanicalsoup clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/MechanicalSoup.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/MechanicalSoup.qhc" devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/MechanicalSoup" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/MechanicalSoup" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/conf.py0000644000175100001720000002013214451070517016237 0ustar00runnerdocker#!/usr/bin/env python3 # # MechanicalSoup documentation build configuration file, created by # sphinx-quickstart on Sun Sep 14 18:44:39 2014. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import os import sys from datetime import datetime # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath('..')) import mechanicalsoup # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = ['sphinx.ext.autodoc'] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix of source filenames. source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = 'MechanicalSoup' copyright = '2014-{}'.format(datetime.utcnow().year) # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = mechanicalsoup.__version__ # The full version, including alpha/beta/rc tags. release = version # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. #language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'default' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". #html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. #html_use_smartypants = True # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Output file base name for HTML help builder. htmlhelp_basename = 'MechanicalSoupdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ ('index', 'MechanicalSoup.tex', 'MechanicalSoup Documentation', '', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ ('index', 'mechanicalsoup', 'MechanicalSoup Documentation', [''], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ('index', 'MechanicalSoup', 'MechanicalSoup Documentation', '', 'MechanicalSoup', 'One line description of project.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/external-resources.rst0000644000175100001720000000241314451070517021326 0ustar00runnerdockerExternal Resources ================== External libraries ------------------ * Requests (HTTP layer): http://docs.python-requests.org/en/master/ * BeautifulSoup (HTML parsing and manipulation): https://www.crummy.com/software/BeautifulSoup/bs4/doc/ MechanicalSoup on the web ------------------------- * `MechanicalSoup tag on stackoverflow `__ * `MechanicalSoup on Gitter `__ * News archive: * `opensource.com blog `__ * `Hacker News post `__ * `Reddit discussion `__ Projects using MechanicalSoup ----------------------------- These projects use MechanicalSoup for web scraping. You may want to look at their source code for real-life examples. * `Chamilo Tools `__ * `gmusicapi `__: an unofficial API for Google Play Music * `PatZilla `__: Patent information research for humans * *TODO: Add your favorite tool here ...* ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/faq.rst0000644000175100001720000002070714451070517016251 0ustar00runnerdockerFrequently Asked Questions ========================== When to use MechanicalSoup? ~~~~~~~~~~~~~~~~~~~~~~~~~~~ MechanicalSoup is designed to simulate the behavior of a human using a web browser. Possible use-case include: * Interacting with a website that doesn't provide a webservice API, out of a browser. * Testing a website you're developing There are also situations when you should *not* use MechanicalSoup, like: * If the website provides a webservice API (e.g. REST), then you should use this API and you don't need MechanicalSoup. * If the website you're interacting with does not contain HTML pages, then MechanicalSoup won't bring anything compared to `requests `__, so just use requests instead. * If the website relies on JavaScript, then you probably need a fully-fledged browser. `Selenium `__ may help you there, but it's a far heavier solution than MechanicalSoup. * If the website is specifically designed to interact with humans, please don't go against the will of the website's owner. How do I get debug information/logs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To understand what's going on while running a script, you have two options: * Use :func:`~mechanicalsoup.StatefulBrowser.set_verbose` to set the debug level to 1 (show one dot for each page opened, a poor man's progress bar) or 2 (show the URL of each visited page). * Activate request's logging:: import requests import logging logging.getLogger().setLevel(logging.DEBUG) requests_log = logging.getLogger("requests.packages.urllib3") requests_log.setLevel(logging.DEBUG) requests_log.propagate = True This will display a much more verbose output, including HTTP status code for each page visited. Note that unlike MechanicalSoup's logging system, this includes URL returning a redirect (e.g. HTTP 301), that are dealt with automatically by requests and not visible to MechanicalSoup. Should I use Browser or StatefulBrowser? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Short answer: :class:`mechanicalsoup.StatefulBrowser`. :class:`mechanicalsoup.Browser` is historically the first class that was introduced in Mechanicalsoup. Using it is a bit verbose, as the caller needs to store the URL of the currently visited page and manipulate the current form with a separate variable. :class:`mechanicalsoup.StatefulBrowser` is essentially a superset of :class:`mechanicalsoup.Browser`, it's the one you should use unless you have a good reason to do otherwise. .. _label-alternatives: How does MechanicalSoup compare to the alternatives? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ There are other libraries with the same purpose as MechanicalSoup: * `Mechanize `__ is an ancestor of MechanicalSoup (getting its name from the Perl mechanize module). It was a great tool, but became unmaintained for several years and didn't support Python 3. Fortunately, Mechanize got a new maintainer in 2017 and completed Python 3 support in 2019. Note that Mechanize is a much bigger piece of code (around 20 times more lines!) than MechanicalSoup, which is small because it delegates most of its work to BeautifulSoup and requests. * `RoboBrowser `__ is very similar to MechanicalSoup. Both are small libraries built on top of requests and BeautifulSoup. Their APIs are very similar. Both have an automated testsuite. As of writing, MechanicalSoup is more actively maintained (only 1 really active developer and no activity since 2015 on RoboBrowser). RoboBrowser is `broken on Python 3.7 `__, and while there is an easy workaround this is a sign that the lack of activity is due to the project being abandoned more than to its maturity. * `Selenium `__ is a much heavier solution: it launches a real web browser (Firefox, Chrome, ...) and controls it with inter-process communication. Selenium is the right solution if you want to test that a website works properly with various browsers (e.g. is the JavaScript code you're writing compatible with all major browsers on the market?), and is generally useful when you need JavaScript support. Though MechanicalSoup does not support JavaScript, it also does not have the overhead of a real web browser, which makes it a simple and efficient solution for basic website interactions. Form submission has no effect or fails ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If you believe you are using MechanicalSoup correctly, but form submission still does not behave the way you expect, the likely explanation is that the page uses JavaScript to dynamically generate response content when you submit the form in a real browser. A common symptom is when form elements are missing required attributes (e.g. if `form` is missing the `action` attribute or an `input` is missing the `name` attribute). In such cases, you typically have two options: 1. If you know what content the server expects to receive from form submission, then you can use MechanicalSoup to manually add that content using, i.e., :func:`~mechanicalsoup.Form.new_control`. This is unlikely to be a reliable solution unless you are testing a website that you own. 2. Use a tool that supports JavaScript, like `Selenium `__. See :ref:`label-alternatives` for more information. My form doesn't have a unique submit name. What can I do? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This answer will help those encountering a "Multiple submit elements match" error when trying to submit a form. Since MechanicalSoup uses `BeautifulSoup `__ under the hood, you can uniquely select any element on the page using its many convenient search functions, e.g. `.find() `__ and `.select() `__. Then you can pass that element to :func:`~mechanicalsoup.Form.choose_submit` or :func:`~mechanicalsoup.StatefulBrowser.submit_selected`, assuming it is a valid submit element. For example, if you have a form with a submit element only identified by a unique ``id="button3"`` attribute, you can do the following:: br = mechanicalsoup.StatefulBrowser() br.open(...) submit = br.page.find('input', id='button3') form = br.select_form() form.choose_submit(submit) br.submit_selected() "No parser was explicitly specified" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. Some versions of BeautifulSoup show a harmless warning to encourage you to specify which HTML parser to use. In MechanicalSoup 0.9, the default parser is set by MechanicalSoup, so you shouldn't get the error anymore (or you should upgrade) unless you specified a non-standard `soup_config` argument to the browser's constructor. If you specify a `soup_config` argument, you should include the parser to use, like:: mechanicalsoup.StatefulBrowser(soup_config={'features': 'lxml', '...': '...'}) Or if you don't have the parser `lxml `__ installed:: mechanicalsoup.StatefulBrowser(soup_config={'features': 'parser.html', ...}) See also https://www.crummy.com/software/BeautifulSoup/bs4/doc/#you-need-a-parser "ReferenceError: weakly-referenced object no longer exists" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This error can occur within requests' ``session.py`` when called by the destructor (``__del__``) of browser. The solution is to call :func:`~mechanicalsoup.Browser.close` before the end of life of the object. Alternatively, you may also use the ``with`` statement which closes the browser for you:: def test_with(): with mechanicalsoup.StatefulBrowser() as browser: browser.open(url) # ... # implicit call to browser.close() here. This problem is fixed in MechanicalSoup 0.10, so this is only required for compatibility with older versions. Code using new versions can let the ``browser`` variable go out of scope and let the garbage collector close it properly. ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/index.rst0000644000175100001720000000321714451070517016606 0ustar00runnerdocker.. MechanicalSoup documentation master file, created by sphinx-quickstart on Sun Sep 14 18:44:39 2014. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. image:: ../assets/mechanical-soup-logo.png :alt: MechanicalSoup. A Python library for automating website interaction. :align: center .. This '|' generates a blank line to avoid sticking the logo to the section. | Welcome to MechanicalSoup's documentation! ========================================== A Python library for automating interaction with websites. MechanicalSoup automatically stores and sends cookies, follows redirects, and can follow links and submit forms. It doesn't do Javascript. MechanicalSoup was created by `M Hickford `__, who was a fond user of the `Mechanize `__ library. Unfortunately, Mechanize is `incompatible with Python 3 `__ and its development stalled for several years. MechanicalSoup provides a similar API, built on Python giants `Requests `__ (for http sessions) and `BeautifulSoup `__ (for document navigation). Since 2017 it is a project actively maintained by a small team including `@hemberger `__ and `@moy `__. Contents: .. toctree:: :maxdepth: 2 introduction tutorial mechanicalsoup faq external-resources ChangeLog Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search` ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/introduction.rst0000644000175100001720000000237214451070517020221 0ustar00runnerdockerIntroduction ============ |Latest Version| |Supported Versions| PyPy3 is also supported (and tested against). Find MechanicalSoup on `Python Package Index (Pypi) `__ and follow the development on `GitHub `__. Installation ------------ Download and install the latest released version from `PyPI `__:: pip install MechanicalSoup Download and install the development version from GitHub:: pip install git+https://github.com/MechanicalSoup/MechanicalSoup Installing from source (installs the version in the current working directory):: git clone https://github.com/MechanicalSoup/MechanicalSoup.git cd MechanicalSoup python setup.py install (In all cases, add ``--user`` to the ``install`` command to install in the current user's home directory.) Example code: https://github.com/MechanicalSoup/MechanicalSoup/tree/main/examples/ .. |Latest Version| image:: https://img.shields.io/pypi/v/MechanicalSoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ .. |Supported Versions| image:: https://img.shields.io/pypi/pyversions/mechanicalsoup.svg :target: https://pypi.python.org/pypi/MechanicalSoup/ ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/make.bat0000644000175100001720000001451314451070517016353 0ustar00runnerdocker@ECHO OFF REM Command file for Sphinx documentation if "%SPHINXBUILD%" == "" ( set SPHINXBUILD=sphinx-build ) set BUILDDIR=_build set ALLSPHINXOPTS=-d %BUILDDIR%/doctrees %SPHINXOPTS% . set I18NSPHINXOPTS=%SPHINXOPTS% . if NOT "%PAPER%" == "" ( set ALLSPHINXOPTS=-D latex_paper_size=%PAPER% %ALLSPHINXOPTS% set I18NSPHINXOPTS=-D latex_paper_size=%PAPER% %I18NSPHINXOPTS% ) if "%1" == "" goto help if "%1" == "help" ( :help echo.Please use `make ^` where ^ is one of echo. html to make standalone HTML files echo. dirhtml to make HTML files named index.html in directories echo. singlehtml to make a single large HTML file echo. pickle to make pickle files echo. json to make JSON files echo. htmlhelp to make HTML files and a HTML help project echo. qthelp to make HTML files and a qthelp project echo. devhelp to make HTML files and a Devhelp project echo. epub to make an epub echo. latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter echo. text to make text files echo. man to make manual pages echo. texinfo to make Texinfo files echo. gettext to make PO message catalogs echo. changes to make an overview over all changed/added/deprecated items echo. xml to make Docutils-native XML files echo. pseudoxml to make pseudoxml-XML files for display purposes echo. linkcheck to check all external links for integrity echo. doctest to run all doctests embedded in the documentation if enabled goto end ) if "%1" == "clean" ( for /d %%i in (%BUILDDIR%\*) do rmdir /q /s %%i del /q /s %BUILDDIR%\* goto end ) %SPHINXBUILD% 2> nul if errorlevel 9009 ( echo. echo.The 'sphinx-build' command was not found. Make sure you have Sphinx echo.installed, then set the SPHINXBUILD environment variable to point echo.to the full path of the 'sphinx-build' executable. Alternatively you echo.may add the Sphinx directory to PATH. echo. echo.If you don't have Sphinx installed, grab it from echo.http://sphinx-doc.org/ exit /b 1 ) if "%1" == "html" ( %SPHINXBUILD% -b html %ALLSPHINXOPTS% %BUILDDIR%/html if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/html. goto end ) if "%1" == "dirhtml" ( %SPHINXBUILD% -b dirhtml %ALLSPHINXOPTS% %BUILDDIR%/dirhtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/dirhtml. goto end ) if "%1" == "singlehtml" ( %SPHINXBUILD% -b singlehtml %ALLSPHINXOPTS% %BUILDDIR%/singlehtml if errorlevel 1 exit /b 1 echo. echo.Build finished. The HTML pages are in %BUILDDIR%/singlehtml. goto end ) if "%1" == "pickle" ( %SPHINXBUILD% -b pickle %ALLSPHINXOPTS% %BUILDDIR%/pickle if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the pickle files. goto end ) if "%1" == "json" ( %SPHINXBUILD% -b json %ALLSPHINXOPTS% %BUILDDIR%/json if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can process the JSON files. goto end ) if "%1" == "htmlhelp" ( %SPHINXBUILD% -b htmlhelp %ALLSPHINXOPTS% %BUILDDIR%/htmlhelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run HTML Help Workshop with the ^ .hhp project file in %BUILDDIR%/htmlhelp. goto end ) if "%1" == "qthelp" ( %SPHINXBUILD% -b qthelp %ALLSPHINXOPTS% %BUILDDIR%/qthelp if errorlevel 1 exit /b 1 echo. echo.Build finished; now you can run "qcollectiongenerator" with the ^ .qhcp project file in %BUILDDIR%/qthelp, like this: echo.^> qcollectiongenerator %BUILDDIR%\qthelp\MechanicalSoup.qhcp echo.To view the help file: echo.^> assistant -collectionFile %BUILDDIR%\qthelp\MechanicalSoup.ghc goto end ) if "%1" == "devhelp" ( %SPHINXBUILD% -b devhelp %ALLSPHINXOPTS% %BUILDDIR%/devhelp if errorlevel 1 exit /b 1 echo. echo.Build finished. goto end ) if "%1" == "epub" ( %SPHINXBUILD% -b epub %ALLSPHINXOPTS% %BUILDDIR%/epub if errorlevel 1 exit /b 1 echo. echo.Build finished. The epub file is in %BUILDDIR%/epub. goto end ) if "%1" == "latex" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex if errorlevel 1 exit /b 1 echo. echo.Build finished; the LaTeX files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdf" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf cd %BUILDDIR%/.. echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "latexpdfja" ( %SPHINXBUILD% -b latex %ALLSPHINXOPTS% %BUILDDIR%/latex cd %BUILDDIR%/latex make all-pdf-ja cd %BUILDDIR%/.. echo. echo.Build finished; the PDF files are in %BUILDDIR%/latex. goto end ) if "%1" == "text" ( %SPHINXBUILD% -b text %ALLSPHINXOPTS% %BUILDDIR%/text if errorlevel 1 exit /b 1 echo. echo.Build finished. The text files are in %BUILDDIR%/text. goto end ) if "%1" == "man" ( %SPHINXBUILD% -b man %ALLSPHINXOPTS% %BUILDDIR%/man if errorlevel 1 exit /b 1 echo. echo.Build finished. The manual pages are in %BUILDDIR%/man. goto end ) if "%1" == "texinfo" ( %SPHINXBUILD% -b texinfo %ALLSPHINXOPTS% %BUILDDIR%/texinfo if errorlevel 1 exit /b 1 echo. echo.Build finished. The Texinfo files are in %BUILDDIR%/texinfo. goto end ) if "%1" == "gettext" ( %SPHINXBUILD% -b gettext %I18NSPHINXOPTS% %BUILDDIR%/locale if errorlevel 1 exit /b 1 echo. echo.Build finished. The message catalogs are in %BUILDDIR%/locale. goto end ) if "%1" == "changes" ( %SPHINXBUILD% -b changes %ALLSPHINXOPTS% %BUILDDIR%/changes if errorlevel 1 exit /b 1 echo. echo.The overview file is in %BUILDDIR%/changes. goto end ) if "%1" == "linkcheck" ( %SPHINXBUILD% -b linkcheck %ALLSPHINXOPTS% %BUILDDIR%/linkcheck if errorlevel 1 exit /b 1 echo. echo.Link check complete; look for any errors in the above output ^ or in %BUILDDIR%/linkcheck/output.txt. goto end ) if "%1" == "doctest" ( %SPHINXBUILD% -b doctest %ALLSPHINXOPTS% %BUILDDIR%/doctest if errorlevel 1 exit /b 1 echo. echo.Testing of doctests in the sources finished, look at the ^ results in %BUILDDIR%/doctest/output.txt. goto end ) if "%1" == "xml" ( %SPHINXBUILD% -b xml %ALLSPHINXOPTS% %BUILDDIR%/xml if errorlevel 1 exit /b 1 echo. echo.Build finished. The XML files are in %BUILDDIR%/xml. goto end ) if "%1" == "pseudoxml" ( %SPHINXBUILD% -b pseudoxml %ALLSPHINXOPTS% %BUILDDIR%/pseudoxml if errorlevel 1 exit /b 1 echo. echo.Build finished. The pseudo-XML files are in %BUILDDIR%/pseudoxml. goto end ) :end ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/mechanicalsoup.rst0000644000175100001720000000112214451070517020463 0ustar00runnerdockerThe mechanicalsoup package: API documentation ============================================= .. module:: mechanicalsoup StatefulBrowser --------------- .. autoclass:: StatefulBrowser :members: :undoc-members: :show-inheritance: :special-members: __setitem__ Browser ------- .. autoclass:: Browser :members: :undoc-members: Form ---- .. autoclass:: Form :members: :undoc-members: :special-members: __setitem__ Exceptions ---------- .. autoexception:: LinkNotFoundError :show-inheritance: .. autoexception:: InvalidFormMethod :show-inheritance: ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/docs/tutorial.rst0000644000175100001720000002050714451070517017343 0ustar00runnerdockerMechanicalSoup tutorial ======================= First contact, step by step --------------------------- As a simple example, we'll browse http://httpbin.org/, a website designed to test tools like MechanicalSoup. First, let's create a browser object:: >>> import mechanicalsoup >>> browser = mechanicalsoup.StatefulBrowser() To customize the way to build a browser (change the user-agent, the HTML parser to use, the way to react to 404 Not Found errors, ...), see :func:`~mechanicalsoup.StatefulBrowser.__init__`. Now, open the webpage we want:: >>> browser.open("http://httpbin.org/") The return value of :func:`~mechanicalsoup.StatefulBrowser.open` is an object of type requests.Response_. Actually, MechanicalSoup is using the requests_ library to do the actual requests to the website, so there's no surprise that we're getting such object. In short, it contains the data and meta-data that the server sent us. You see the HTTP response status, 200, which means "OK", but the object also contains the content of the page we just downloaded. Just like a normal browser's URL bar, the browser remembers which URL it's browsing:: >>> browser.url 'http://httpbin.org/' Now, let's follow the link to ``/forms/post``:: >>> browser.follow_link("forms") >>> browser.url 'http://httpbin.org/forms/post' We passed a regular expression ``"forms"`` to :func:`~mechanicalsoup.StatefulBrowser.follow_link`, who followed the link whose text matched this expression. There are many other ways to call :func:`~mechanicalsoup.StatefulBrowser.follow_link`, but we'll get back to it. We're now visiting http://httpbin.org/forms/post, which contains a form. Let's see the page content:: >>> browser.page ...
... Actually, the return type of :func:`~mechanicalsoup.StatefulBrowser().page` is bs4.BeautifulSoup_. BeautifulSoup, aka bs4, is the second library used by Mechanicalsoup: it is an HTML manipulation library. You can now navigate in the tags of the pages using BeautifulSoup. For example, to get all the ```` tags:: >>> browser.page.find_all('legend') [ Pizza Size , Pizza Toppings ] To fill-in a form, we need to tell MechanicalSoup which form we're going to fill-in and submit:: >>> browser.select_form('form[action="/post"]') The argument to :func:`~mechanicalsoup.StatefulBrowser.select_form` is a CSS selector. Here, we select an HTML tag named ``form`` having an attribute ``action`` whose value is ``"/post"``. Since there's only one form in the page, ``browser.select_form()`` would have done the trick too. Now, give a value to fields in the form. First, what are the available fields? You can print a summary of the currently selected form with :func:`~mechanicalsoup.Form.print_summary()`:: >>> browser.form.print_summary() For text fields, it's simple: just give a value for ``input`` element based on their ``name`` attribute:: >>> browser["custname"] = "Me" >>> browser["custtel"] = "00 00 0001" >>> browser["custemail"] = "nobody@example.com" >>> browser["comments"] = "This pizza looks really good :-)" For radio buttons, well, it's simple too: radio buttons have several ``input`` tags with the same ``name`` and different values, just select the one you need (``"size"`` is the ``name`` attribute, ``"medium"`` is the ``"value"`` attribute of the element we want to tick):: >>> browser["size"] = "medium" For checkboxes, one can use the same mechanism to check one box:: >>> browser["topping"] = "bacon" But we can also check any number of boxes by assigning a list to the field:: >>> browser["topping"] = ("bacon", "cheese") Actually, ``browser["..."] = "..."`` (i.e. calls to :func:`~mechanicalsoup.StatefulBrowser.__setitem__`) is just a helper to fill-in a form, but you can use any tool BeautifulSoup provides to modify the soup object, and MechanicalSoup will take care of submitting the form for you. Let's see what the filled-in form looks like:: >>> browser.launch_browser() :func:`~mechanicalsoup.StatefulBrowser.launch_browser` will launch a real web browser on the current page visited by our ``browser`` object, including the changes we just made to the form (note that it does not open the real webpage, but creates a temporary file containing the page content, and points your browser to this file). Try changing the boxes ticked and the content of the text field, and re-launch the browser. This method is very useful in complement with your browser's web development tools. For example, with Firefox, right-click "Inspect Element" on a field will give you everything you need to manipulate this field (in particular the ``name`` and ``value`` attributes). It's also possible to check the content with :func:`~mechanicalsoup.Form.print_summary()` (that we already used to list the fields):: >>> browser.form.print_summary() Assuming we're satisfied with the content of the form, we can submit it (i.e. simulate a click on the submit button):: >>> response = browser.submit_selected() The response is not an HTML page, so the browser doesn't parse it to a BeautifulSoup object, but we can still see the text it contains:: >>> print(response.text) { "args": {}, "data": "", "files": {}, "form": { "comments": "This pizza looks really good :-)", "custemail": "nobody@example.com", "custname": "Me", "custtel": "00 00 0001", "delivery": "", "size": "medium", "topping": [ "bacon", "cheese" ] }, ... To sum up, here is the complete example (`examples/expl_httpbin.py `__): .. literalinclude:: ../examples/expl_httpbin.py .. _requests: http://docs.python-requests.org/en/master/ .. _requests.Response: http://docs.python-requests.org/en/master/api/#requests.Response .. _bs4.BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup A more complete example: logging-in into GitHub ----------------------------------------------- The simplest way to use MechanicalSoup is to use the :class:`~mechanicalsoup.StatefulBrowser` class (this example is available as `examples/example.py `__ in MechanicalSoup's source code): .. literalinclude:: ../examples/example.py :language: python Alternatively, one can use the :class:`~mechanicalsoup.Browser` class, which doesn't maintain a state from one call to another (i.e. the Browser itself doesn't remember which page you are visiting and what its content is, it's up to the caller to do so). This example is available as `examples/example_manual.py `__ in the source: .. literalinclude:: ../examples/example_manual.py :language: python More examples ~~~~~~~~~~~~~ For more examples, see the `examples `__ directory in MechanicalSoup's source code. ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7279239 MechanicalSoup-1.3.0/examples/0000755000175100001720000000000014451070531015624 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/examples/example.py0000644000175100001720000000243214451070517017636 0ustar00runnerdocker"""Example app to login to GitHub using the StatefulBrowser class. NOTE: This example will not work if the user has 2FA enabled.""" import argparse from getpass import getpass import mechanicalsoup parser = argparse.ArgumentParser(description="Login to GitHub.") parser.add_argument("username") args = parser.parse_args() args.password = getpass("Please enter your GitHub password: ") browser = mechanicalsoup.StatefulBrowser( soup_config={'features': 'lxml'}, raise_on_404=True, user_agent='MyBot/0.1: mysite.example.com/bot_info', ) # Uncomment for a more verbose output: # browser.set_verbose(2) browser.open("https://github.com") browser.follow_link("login") browser.select_form('#login form') browser["login"] = args.username browser["password"] = args.password resp = browser.submit_selected() # Uncomment to launch a web browser on the current page: # browser.launch_browser() # verify we are now logged in page = browser.page messages = page.find("div", class_="flash-messages") if messages: print(messages.text) assert page.select(".logout-form") print(page.title.text) # verify we remain logged in (thanks to cookies) as we browse the rest of # the site page3 = browser.open("https://github.com/MechanicalSoup/MechanicalSoup") assert page3.soup.select(".logout-form") ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/examples/example_manual.py0000644000175100001720000000274014451070517021175 0ustar00runnerdocker"""Example app to login to GitHub, using the plain Browser class. See example.py for an example using the more advanced StatefulBrowser.""" import argparse import mechanicalsoup parser = argparse.ArgumentParser(description="Login to GitHub.") parser.add_argument("username") parser.add_argument("password") args = parser.parse_args() browser = mechanicalsoup.Browser(soup_config={'features': 'lxml'}) # request github login page. the result is a requests.Response object # http://docs.python-requests.org/en/latest/user/quickstart/#response-content login_page = browser.get("https://github.com/login") # similar to assert login_page.ok but with full status code in case of # failure. login_page.raise_for_status() # login_page.soup is a BeautifulSoup object # http://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautifulsoup # we grab the login form login_form = mechanicalsoup.Form(login_page.soup.select_one('#login form')) # specify username and password login_form.input({"login": args.username, "password": args.password}) # submit form page2 = browser.submit(login_form, login_page.url) # verify we are now logged in messages = page2.soup.find("div", class_="flash-messages") if messages: print(messages.text) assert page2.soup.select(".logout-form") print(page2.soup.title.text) # verify we remain logged in (thanks to cookies) as we browse the rest of # the site page3 = browser.get("https://github.com/MechanicalSoup/MechanicalSoup") assert page3.soup.select(".logout-form") ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7319238 MechanicalSoup-1.3.0/mechanicalsoup/0000755000175100001720000000000014451070531017001 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/__init__.py0000644000175100001720000000046614451070517021124 0ustar00runnerdockerfrom .__version__ import __version__ from .browser import Browser from .form import Form, InvalidFormMethod from .stateful_browser import StatefulBrowser from .utils import LinkNotFoundError __all__ = ['StatefulBrowser', 'LinkNotFoundError', 'Browser', 'Form', 'InvalidFormMethod', '__version__'] ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/__version__.py0000644000175100001720000000057114451070517021643 0ustar00runnerdocker__title__ = 'MechanicalSoup' __description__ = 'A Python library for automating interaction with websites' __url__ = 'https://mechanicalsoup.readthedocs.io/' __github_url__ = 'https://github.com/MechanicalSoup/MechanicalSoup' __version__ = '1.3.0' __license__ = 'MIT' __github_assets_absoluteURL__ = """\ https://raw.githubusercontent.com/MechanicalSoup/MechanicalSoup/main""" ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/browser.py0000644000175100001720000003521314451070517021046 0ustar00runnerdockerimport io import os import tempfile import urllib import weakref import webbrowser import bs4 import bs4.dammit import requests from .__version__ import __title__, __version__ from .form import Form from .utils import LinkNotFoundError, is_multipart_file_upload class Browser: """Builds a low-level Browser. It is recommended to use :class:`StatefulBrowser` for most applications, since it offers more advanced features and conveniences than Browser. :param session: Attach a pre-existing requests Session instead of constructing a new one. :param soup_config: Configuration passed to BeautifulSoup to affect the way HTML is parsed. Defaults to ``{'features': 'lxml'}``. If overridden, it is highly recommended to `specify a parser `__. Otherwise, BeautifulSoup will issue a warning and pick one for you, but the parser it chooses may be different on different machines. :param requests_adapters: Configuration passed to requests, to affect the way HTTP requests are performed. :param raise_on_404: If True, raise :class:`LinkNotFoundError` when visiting a page triggers a 404 Not Found error. :param user_agent: Set the user agent header to this value. """ def __init__(self, session=None, soup_config={'features': 'lxml'}, requests_adapters=None, raise_on_404=False, user_agent=None): self.raise_on_404 = raise_on_404 self.session = session or requests.Session() if hasattr(weakref, 'finalize'): self._finalize = weakref.finalize(self.session, self.close) else: # pragma: no cover # Python < 3 does not have weakref.finalize, but these # versions accept calling session.close() within __del__ self._finalize = self.close self.set_user_agent(user_agent) if requests_adapters is not None: for adaptee, adapter in requests_adapters.items(): self.session.mount(adaptee, adapter) self.soup_config = soup_config or dict() @staticmethod def __looks_like_html(response): """Guesses entity type when Content-Type header is missing. Since Content-Type is not strictly required, some servers leave it out. """ text = response.text.lstrip().lower() return text.startswith(' The HTTP header has a higher precedence than the in-document # > meta declarations. encoding = http_encoding if http_encoding else html_encoding response.soup = bs4.BeautifulSoup( response.content, from_encoding=encoding, **soup_config ) else: response.soup = None def set_cookiejar(self, cookiejar): """Replaces the current cookiejar in the requests session. Since the session handles cookies automatically without calling this function, only use this when default cookie handling is insufficient. :param cookiejar: Any `http.cookiejar.CookieJar `__ compatible object. """ self.session.cookies = cookiejar def get_cookiejar(self): """Gets the cookiejar from the requests session.""" return self.session.cookies def set_user_agent(self, user_agent): """Replaces the current user agent in the requests session headers.""" # set a default user_agent if not specified if user_agent is None: requests_ua = requests.utils.default_user_agent() user_agent = f'{requests_ua} ({__title__}/{__version__})' # the requests module uses a case-insensitive dict for session headers self.session.headers['User-agent'] = user_agent def request(self, *args, **kwargs): """Straightforward wrapper around `requests.Session.request `__. :return: `requests.Response `__ object with a *soup*-attribute added by :func:`add_soup`. This is a low-level function that should not be called for basic usage (use :func:`get` or :func:`post` instead). Use it if you need an HTTP verb that MechanicalSoup doesn't manage (e.g. MKCOL) for example. """ response = self.session.request(*args, **kwargs) Browser.add_soup(response, self.soup_config) return response def get(self, *args, **kwargs): """Straightforward wrapper around `requests.Session.get `__. :return: `requests.Response `__ object with a *soup*-attribute added by :func:`add_soup`. """ response = self.session.get(*args, **kwargs) if self.raise_on_404 and response.status_code == 404: raise LinkNotFoundError() Browser.add_soup(response, self.soup_config) return response def post(self, *args, **kwargs): """Straightforward wrapper around `requests.Session.post `__. :return: `requests.Response `__ object with a *soup*-attribute added by :func:`add_soup`. """ response = self.session.post(*args, **kwargs) Browser.add_soup(response, self.soup_config) return response def put(self, *args, **kwargs): """Straightforward wrapper around `requests.Session.put `__. :return: `requests.Response `__ object with a *soup*-attribute added by :func:`add_soup`. """ response = self.session.put(*args, **kwargs) Browser.add_soup(response, self.soup_config) return response @staticmethod def _get_request_kwargs(method, url, **kwargs): """This method exists to raise a TypeError when a method or url is specified in the kwargs. """ request_kwargs = {"method": method, "url": url} request_kwargs.update(kwargs) return request_kwargs @classmethod def get_request_kwargs(cls, form, url=None, **kwargs): """Extract input data from the form.""" method = str(form.get("method", "get")) action = form.get("action") url = urllib.parse.urljoin(url, action) if url is None: # This happens when both `action` and `url` are None. raise ValueError('no URL to submit to') # read https://www.w3.org/TR/html52/sec-forms.html if method.lower() == "get": data = kwargs.pop("params", dict()) else: data = kwargs.pop("data", dict()) files = kwargs.pop("files", dict()) # Use a list of 2-tuples to better reflect the behavior of browser QSL. # Requests also retains order when encoding form data in 2-tuple lists. data = [(k, v) for k, v in data.items()] multipart = form.get("enctype", "") == "multipart/form-data" # Process form tags in the order that they appear on the page, # skipping those tags that do not have a name-attribute. selector = ",".join(f"{tag}[name]" for tag in ("input", "button", "textarea", "select")) for tag in form.select(selector): name = tag.get("name") # name-attribute of tag # Skip disabled elements, since they should not be submitted. if tag.has_attr('disabled'): continue if tag.name == "input": if tag.get("type", "").lower() in ("radio", "checkbox"): if "checked" not in tag.attrs: continue value = tag.get("value", "on") else: # browsers use empty string for inputs with missing values value = tag.get("value", "") # If the enctype is not multipart, the filename is put in # the form as a text input and the file is not sent. if is_multipart_file_upload(form, tag): if isinstance(value, io.IOBase): content = value filename = os.path.basename(getattr(value, "name", "")) else: content = "" filename = os.path.basename(value) # If content is the empty string, we still pass it # for consistency with browsers (see # https://github.com/MechanicalSoup/MechanicalSoup/issues/250). files[name] = (filename, content) else: if isinstance(value, io.IOBase): value = os.path.basename(getattr(value, "name", "")) data.append((name, value)) elif tag.name == "button": if tag.get("type", "").lower() in ("button", "reset"): continue else: data.append((name, tag.get("value", ""))) elif tag.name == "textarea": data.append((name, tag.text)) elif tag.name == "select": # If the value attribute is not specified, the content will # be passed as a value instead. options = tag.select("option") selected_values = [i.get("value", i.text) for i in options if "selected" in i.attrs] if "multiple" in tag.attrs: for value in selected_values: data.append((name, value)) elif selected_values: # A standard select element only allows one option to be # selected, but browsers pick last if somehow multiple. data.append((name, selected_values[-1])) elif options: # Selects the first option if none are selected first_value = options[0].get("value", options[0].text) data.append((name, first_value)) if method.lower() == "get": kwargs["params"] = data else: kwargs["data"] = data # The following part of the function is here to respect the # enctype specified by the form, i.e. force sending multipart # content. Since Requests doesn't have yet a feature to choose # enctype, we have to use tricks to make it behave as we want # This code will be updated if Requests implements it. if multipart and not files: # Requests will switch to "multipart/form-data" only if # files pass the `if files:` test, so in this case we use # a modified dict that passes the if test even if empty. class DictThatReturnsTrue(dict): def __bool__(self): return True __nonzero__ = __bool__ files = DictThatReturnsTrue() return cls._get_request_kwargs(method, url, files=files, **kwargs) def _request(self, form, url=None, **kwargs): """Extract input data from the form to pass to a Requests session.""" request_kwargs = Browser.get_request_kwargs(form, url, **kwargs) return self.session.request(**request_kwargs) def submit(self, form, url=None, **kwargs): """Prepares and sends a form request. NOTE: To submit a form with a :class:`StatefulBrowser` instance, it is recommended to use :func:`StatefulBrowser.submit_selected` instead of this method so that the browser state is correctly updated. :param form: The filled-out form. :param url: URL of the page the form is on. If the form action is a relative path, then this must be specified. :param \\*\\*kwargs: Arguments forwarded to `requests.Session.request `__. If `files`, `params` (with GET), or `data` (with POST) are specified, they will be appended to by the contents of `form`. :return: `requests.Response `__ object with a *soup*-attribute added by :func:`add_soup`. """ if isinstance(form, Form): form = form.form response = self._request(form, url, **kwargs) Browser.add_soup(response, self.soup_config) return response def launch_browser(self, soup): """Launch a browser to display a page, for debugging purposes. :param: soup: Page contents to display, supplied as a bs4 soup object. """ with tempfile.NamedTemporaryFile(delete=False, suffix='.html') as file: file.write(soup.encode()) webbrowser.open('file://' + file.name) def close(self): """Close the current session, if still open.""" if self.session is not None: self.session.cookies.clear() self.session.close() self.session = None def __del__(self): self._finalize() def __enter__(self): return self def __exit__(self, *args): self.close() ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/form.py0000644000175100001720000004012614451070517020325 0ustar00runnerdockerimport copy import io import warnings from bs4 import BeautifulSoup from .utils import LinkNotFoundError, is_multipart_file_upload class InvalidFormMethod(LinkNotFoundError): """This exception is raised when a method of :class:`Form` is used for an HTML element that is of the wrong type (or is malformed). It is caught within :func:`Form.set` to perform element type deduction. It is derived from :class:`LinkNotFoundError` so that a single base class can be used to catch all exceptions specific to this module. """ pass class Form: """Build a fillable form. :param form: A bs4.element.Tag corresponding to an HTML form element. The Form class is responsible for preparing HTML forms for submission. It handles the following types of elements: input (text, checkbox, radio), select, and textarea. Each type is set by a method named after the type (e.g. :func:`~Form.set_select`), and then there are convenience methods (e.g. :func:`~Form.set`) that do type-deduction and set the value using the appropriate method. It also handles submit-type elements using :func:`~Form.choose_submit`. """ def __init__(self, form): if form.name != 'form': warnings.warn( f"Constructed a Form from a '{form.name}' instead of a 'form' " " element. This may be an error in a future version of " "MechanicalSoup.", FutureWarning) self.form = form self._submit_chosen = False # Aliases for backwards compatibility # (Included specifically in __init__ to suppress them in Sphinx docs) self.attach = self.set_input self.input = self.set_input self.textarea = self.set_textarea def set_input(self, data): """Fill-in a set of fields in a form. Example: filling-in a login/password form .. code-block:: python form.set_input({"login": username, "password": password}) This will find the input element named "login" and give it the value ``username``, and the input element named "password" and give it the value ``password``. """ for (name, value) in data.items(): i = self.form.find("input", {"name": name}) if not i: raise InvalidFormMethod("No input field named " + name) self._assert_valid_file_upload(i, value) i["value"] = value def uncheck_all(self, name): """Remove the *checked*-attribute of all input elements with a *name*-attribute given by ``name``. """ for option in self.form.find_all("input", {"name": name}): if "checked" in option.attrs: del option.attrs["checked"] def check(self, data): """For backwards compatibility, this method handles checkboxes and radio buttons in a single call. It will not uncheck any checkboxes unless explicitly specified by ``data``, in contrast with the default behavior of :func:`~Form.set_checkbox`. """ for (name, value) in data.items(): try: self.set_checkbox({name: value}, uncheck_other_boxes=False) continue except InvalidFormMethod: pass try: self.set_radio({name: value}) continue except InvalidFormMethod: pass raise LinkNotFoundError("No input checkbox/radio named " + name) def set_checkbox(self, data, uncheck_other_boxes=True): """Set the *checked*-attribute of input elements of type "checkbox" specified by ``data`` (i.e. check boxes). :param data: Dict of ``{name: value, ...}``. In the family of checkboxes whose *name*-attribute is ``name``, check the box whose *value*-attribute is ``value``. All boxes in the family can be checked (unchecked) if ``value`` is True (False). To check multiple specific boxes, let ``value`` be a tuple or list. :param uncheck_other_boxes: If True (default), before checking any boxes specified by ``data``, uncheck the entire checkbox family. Consider setting to False if some boxes are checked by default when the HTML is served. """ for (name, value) in data.items(): # Case-insensitive search for type=checkbox selector = 'input[type="checkbox" i][name="{}"]'.format(name) checkboxes = self.form.select(selector) if not checkboxes: raise InvalidFormMethod("No input checkbox named " + name) # uncheck if requested if uncheck_other_boxes: self.uncheck_all(name) # Wrap individual values (e.g. int, str) in a 1-element tuple. if not isinstance(value, list) and not isinstance(value, tuple): value = (value,) # Check or uncheck one or more boxes for choice in value: choice_str = str(choice) # Allow for example literal numbers for checkbox in checkboxes: if checkbox.attrs.get("value", "on") == choice_str: checkbox["checked"] = "" break # Allow specifying True or False to check/uncheck elif choice is True: checkbox["checked"] = "" break elif choice is False: if "checked" in checkbox.attrs: del checkbox.attrs["checked"] break else: raise LinkNotFoundError( "No input checkbox named %s with choice %s" % (name, choice) ) def set_radio(self, data): """Set the *checked*-attribute of input elements of type "radio" specified by ``data`` (i.e. select radio buttons). :param data: Dict of ``{name: value, ...}``. In the family of radio buttons whose *name*-attribute is ``name``, check the radio button whose *value*-attribute is ``value``. Only one radio button in the family can be checked. """ for (name, value) in data.items(): # Case-insensitive search for type=radio selector = 'input[type="radio" i][name="{}"]'.format(name) radios = self.form.select(selector) if not radios: raise InvalidFormMethod("No input radio named " + name) # only one radio button can be checked self.uncheck_all(name) # Check the appropriate radio button (value cannot be a list/tuple) for radio in radios: if radio.attrs.get("value", "on") == str(value): radio["checked"] = "" break else: raise LinkNotFoundError( f"No input radio named {name} with choice {value}" ) def set_textarea(self, data): """Set the *string*-attribute of the first textarea element specified by ``data`` (i.e. set the text of a textarea). :param data: Dict of ``{name: value, ...}``. The textarea whose *name*-attribute is ``name`` will have its *string*-attribute set to ``value``. """ for (name, value) in data.items(): t = self.form.find("textarea", {"name": name}) if not t: raise InvalidFormMethod("No textarea named " + name) t.string = value def set_select(self, data): """Set the *selected*-attribute of the first option element specified by ``data`` (i.e. select an option from a dropdown). :param data: Dict of ``{name: value, ...}``. Find the select element whose *name*-attribute is ``name``. Then select from among its children the option element whose *value*-attribute is ``value``. If no matching *value*-attribute is found, this will search for an option whose text matches ``value``. If the select element's *multiple*-attribute is set, then ``value`` can be a list or tuple to select multiple options. """ for (name, value) in data.items(): select = self.form.find("select", {"name": name}) if not select: raise InvalidFormMethod("No select named " + name) # Deselect all options first for option in select.find_all("option"): if "selected" in option.attrs: del option.attrs["selected"] # Wrap individual values in a 1-element tuple. # If value is a list/tuple, select must be a ``) will be added using :func:`~Form.new_control`. Example: filling-in a login/password form with EULA checkbox .. code-block:: python form.set("login", username) form.set("password", password) form.set("eula-checkbox", True) Example: uploading a file through a ```` field (provide an open file object, and its content will be uploaded): .. code-block:: python form.set("tagname", open(path_to_local_file, "rb")) """ for func in ("checkbox", "radio", "input", "textarea", "select"): try: getattr(self, "set_" + func)({name: value}) return except InvalidFormMethod: pass if force: self.new_control('text', name, value=value) return raise LinkNotFoundError("No valid element named " + name) def new_control(self, type, name, value, **kwargs): """Add a new input element to the form. The arguments set the attributes of the new element. """ # Remove existing input-like elements with the same name for tag in ('input', 'textarea', 'select'): for old in self.form.find_all(tag, {'name': name}): old.decompose() # We don't have access to the original soup object (just the # Tag), so we instantiate a new BeautifulSoup() to call # new_tag(). We're only building the soup object, not parsing # anything, so the parser doesn't matter. Specify the one # included in Python to avoid having dependency issue. control = BeautifulSoup("", "html.parser").new_tag('input') control['type'] = type control['name'] = name control['value'] = value for k, v in kwargs.items(): control[k] = v self._assert_valid_file_upload(control, value) self.form.append(control) return control def choose_submit(self, submit): """Selects the input (or button) element to use for form submission. :param submit: The :class:`bs4.element.Tag` (or just its *name*-attribute) that identifies the submit element to use. If ``None``, will choose the first valid submit element in the form, if one exists. If ``False``, will not use any submit element; this is useful for simulating AJAX requests, for example. To simulate a normal web browser, only one submit element must be sent. Therefore, this does not need to be called if there is only one submit element in the form. If the element is not found or if multiple elements match, raise a :class:`LinkNotFoundError` exception. Example: :: browser = mechanicalsoup.StatefulBrowser() browser.open(url) form = browser.select_form() form.choose_submit('form_name_attr') browser.submit_selected() """ # Since choose_submit is destructive, it doesn't make sense to call # this method twice unless no submit is specified. if self._submit_chosen: if submit is None: return else: raise Exception('Submit already chosen. Cannot change submit!') # All buttons NOT of type (button,reset) are valid submits # Case-insensitive search for type=submit inps = [i for i in self.form.select('input[type="submit" i], button') if i.get("type", "").lower() not in ('button', 'reset')] # If no submit specified, choose the first one if submit is None and inps: submit = inps[0] found = False for inp in inps: if (inp.has_attr('name') and inp['name'] == submit): if found: raise LinkNotFoundError( f"Multiple submit elements match: {submit}" ) found = True elif inp == submit: if found: # Ignore submit element since it is an exact # duplicate of the one we're looking at. del inp['name'] found = True else: # Delete any non-matching element's name so that it will be # omitted from the submitted form data. del inp['name'] if not found and submit is not None and submit is not False: raise LinkNotFoundError( f"Specified submit element not found: {submit}" ) self._submit_chosen = True def print_summary(self): """Print a summary of the form. May help finding which fields need to be filled-in. """ for input in self.form.find_all( ("input", "textarea", "select", "button")): input_copy = copy.copy(input) # Text between the opening tag and the closing tag often # contains a lot of spaces that we don't want here. for subtag in input_copy.find_all() + [input_copy]: if subtag.string: subtag.string = subtag.string.strip() print(input_copy) def _assert_valid_file_upload(self, tag, value): """Raise an exception if a multipart file input is not an open file.""" if ( is_multipart_file_upload(self.form, tag) and not isinstance(value, io.IOBase) ): raise ValueError( "From v1.3.0 onwards, you must pass an open file object " 'directly, e.g. `form["name"] = open("/path/to/file", "rb")`. ' "This change is to remediate a security vulnerability where " "a malicious web server could read arbitrary files from the " "client (CVE-2023-34457)." ) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/stateful_browser.py0000644000175100001720000004125114451070517022754 0ustar00runnerdockerimport re import sys import urllib import bs4 from .browser import Browser from .form import Form from .utils import LinkNotFoundError from requests.structures import CaseInsensitiveDict class _BrowserState: def __init__(self, page=None, url=None, form=None, request=None): self.page = page self.url = url self.form = form self.request = request class StatefulBrowser(Browser): """An extension of :class:`Browser` that stores the browser's state and provides many convenient functions for interacting with HTML elements. It is the primary tool in MechanicalSoup for interfacing with websites. :param session: Attach a pre-existing requests Session instead of constructing a new one. :param soup_config: Configuration passed to BeautifulSoup to affect the way HTML is parsed. Defaults to ``{'features': 'lxml'}``. If overridden, it is highly recommended to `specify a parser `__. Otherwise, BeautifulSoup will issue a warning and pick one for you, but the parser it chooses may be different on different machines. :param requests_adapters: Configuration passed to requests, to affect the way HTTP requests are performed. :param raise_on_404: If True, raise :class:`LinkNotFoundError` when visiting a page triggers a 404 Not Found error. :param user_agent: Set the user agent header to this value. All arguments are forwarded to :func:`Browser`. Examples :: browser = mechanicalsoup.StatefulBrowser( soup_config={'features': 'lxml'}, # Use the lxml HTML parser raise_on_404=True, user_agent='MyBot/0.1: mysite.example.com/bot_info', ) browser.open(url) # ... browser.close() Once not used anymore, the browser can be closed using :func:`~Browser.close`. """ def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.__debug = False self.__verbose = 0 self.__state = _BrowserState() # Aliases for backwards compatibility # (Included specifically in __init__ to suppress them in Sphinx docs) self.get_current_page = lambda: self.page # Almost same as self.form, but don't raise an error if no # form was selected for backward compatibility. self.get_current_form = lambda: self.__state.form self.get_url = lambda: self.url def set_debug(self, debug): """Set the debug mode (off by default). Set to True to enable debug mode. When active, some actions will launch a browser on the current page on failure to let you inspect the page content. """ self.__debug = debug def get_debug(self): """Get the debug mode (off by default).""" return self.__debug def set_verbose(self, verbose): """Set the verbosity level (an integer). * 0 means no verbose output. * 1 shows one dot per visited page (looks like a progress bar) * >= 2 shows each visited URL. """ self.__verbose = verbose def get_verbose(self): """Get the verbosity level. See :func:`set_verbose()`.""" return self.__verbose @property def page(self): """Get the current page as a soup object.""" return self.__state.page @property def url(self): """Get the URL of the currently visited page.""" return self.__state.url @property def form(self): """Get the currently selected form as a :class:`Form` object. See :func:`select_form`. """ if self.__state.form is None: raise AttributeError("No form has been selected yet on this page.") return self.__state.form def __setitem__(self, name, value): """Call item assignment on the currently selected form. See :func:`Form.__setitem__`. """ self.form[name] = value def new_control(self, type, name, value, **kwargs): """Call :func:`Form.new_control` on the currently selected form.""" return self.form.new_control(type, name, value, **kwargs) def absolute_url(self, url): """Return the absolute URL made from the current URL and ``url``. The current URL is only used to provide any missing components of ``url``, as in the `.urljoin() method of urllib.parse `__. """ return urllib.parse.urljoin(self.url, url) def open(self, url, *args, **kwargs): """Open the URL and store the Browser's state in this object. All arguments are forwarded to :func:`Browser.get`. :return: Forwarded from :func:`Browser.get`. """ if self.__verbose == 1: sys.stdout.write('.') sys.stdout.flush() elif self.__verbose >= 2: print(url) resp = self.get(url, *args, **kwargs) self.__state = _BrowserState(page=resp.soup, url=resp.url, request=resp.request) return resp def open_fake_page(self, page_text, url=None, soup_config=None): """Mock version of :func:`open`. Behave as if opening a page whose text is ``page_text``, but do not perform any network access. If ``url`` is set, pretend it is the page's URL. Useful mainly for testing. """ soup_config = soup_config or self.soup_config self.__state = _BrowserState( page=bs4.BeautifulSoup(page_text, **soup_config), url=url) def open_relative(self, url, *args, **kwargs): """Like :func:`open`, but ``url`` can be relative to the currently visited page. """ return self.open(self.absolute_url(url), *args, **kwargs) def refresh(self): """Reload the current page with the same request as originally done. Any change (`select_form`, or any value filled-in in the form) made to the current page before refresh is discarded. :raise ValueError: Raised if no refreshable page is loaded, e.g., when using the shallow ``Browser`` wrapper functions. :return: Response of the request.""" old_request = self.__state.request if old_request is None: raise ValueError('The current page is not refreshable. Either no ' 'page is opened or low-level browser methods ' 'were used to do so') resp = self.session.send(old_request) Browser.add_soup(resp, self.soup_config) self.__state = _BrowserState(page=resp.soup, url=resp.url, request=resp.request) return resp def select_form(self, selector="form", nr=0): """Select a form in the current page. :param selector: CSS selector or a bs4.element.Tag object to identify the form to select. If not specified, ``selector`` defaults to "form", which is useful if, e.g., there is only one form on the page. For ``selector`` syntax, see the `.select() method in BeautifulSoup `__. :param nr: A zero-based index specifying which form among those that match ``selector`` will be selected. Useful when one or more forms have the same attributes as the form you want to select, and its position on the page is the only way to uniquely identify it. Default is the first matching form (``nr=0``). :return: The selected form as a soup object. It can also be retrieved later with the :attr:`form` attribute. """ def find_associated_elements(form_id): """Find all elements associated to a form (i.e. an element with a form attribute -> ``form=form_id``) """ # Elements which can have a form owner elements_with_owner_form = ("input", "button", "fieldset", "object", "output", "select", "textarea") found_elements = [] for element in elements_with_owner_form: found_elements.extend( self.page.find_all(element, form=form_id) ) return found_elements if isinstance(selector, bs4.element.Tag): if selector.name != "form": raise LinkNotFoundError form = selector else: # nr is a 0-based index for consistency with mechanize found_forms = self.page.select(selector, limit=nr + 1) if len(found_forms) != nr + 1: if self.__debug: print('select_form failed for', selector) self.launch_browser() raise LinkNotFoundError() form = found_forms[-1] if form and form.has_attr('id'): form_id = form["id"] new_elements = find_associated_elements(form_id) form.extend(new_elements) self.__state.form = Form(form) return self.form def _merge_referer(self, **kwargs): """Helper function to set the Referer header in kwargs passed to requests, if it has not already been overridden by the user.""" referer = self.url headers = CaseInsensitiveDict(kwargs.get('headers', {})) if referer is not None and 'Referer' not in headers: headers['Referer'] = referer kwargs['headers'] = headers return kwargs def submit_selected(self, btnName=None, update_state=True, **kwargs): """Submit the form that was selected with :func:`select_form`. :return: Forwarded from :func:`Browser.submit`. :param btnName: Passed to :func:`Form.choose_submit` to choose the element of the current form to use for submission. If ``None``, will choose the first valid submit element in the form, if one exists. If ``False``, will not use any submit element; this is useful for simulating AJAX requests, for example. :param update_state: If False, the form will be submitted but the browser state will remain unchanged; this is useful for forms that result in a download of a file, for example. All other arguments are forwarded to :func:`Browser.submit`. """ self.form.choose_submit(btnName) kwargs = self._merge_referer(**kwargs) resp = self.submit(self.__state.form, url=self.__state.url, **kwargs) if update_state: self.__state = _BrowserState(page=resp.soup, url=resp.url, request=resp.request) return resp def list_links(self, *args, **kwargs): """Display the list of links in the current page. Arguments are forwarded to :func:`links`. """ print("Links in the current page:") for link in self.links(*args, **kwargs): print(" ", link) def links(self, url_regex=None, link_text=None, *args, **kwargs): """Return links in the page, as a list of bs4.element.Tag objects. To return links matching specific criteria, specify ``url_regex`` to match the *href*-attribute, or ``link_text`` to match the *text*-attribute of the Tag. All other arguments are forwarded to the `.find_all() method in BeautifulSoup `__. """ all_links = self.page.find_all( 'a', href=True, *args, **kwargs) if url_regex is not None: all_links = [a for a in all_links if re.search(url_regex, a['href'])] if link_text is not None: all_links = [a for a in all_links if a.text == link_text] return all_links def find_link(self, *args, **kwargs): """Find and return a link, as a bs4.element.Tag object. The search can be refined by specifying any argument that is accepted by :func:`links`. If several links match, return the first one found. If no link is found, raise :class:`LinkNotFoundError`. """ links = self.links(*args, **kwargs) if len(links) == 0: raise LinkNotFoundError() else: return links[0] def _find_link_internal(self, link, args, kwargs): """Wrapper around find_link that deals with convenience special-cases: * If ``link`` has an *href*-attribute, then return it. If not, consider it as a ``url_regex`` argument. * If searching for the link fails and debug is active, launch a browser. """ if hasattr(link, 'attrs') and 'href' in link.attrs: return link # Check if "link" parameter should be treated as "url_regex" # but reject obtaining it from both places. if link and 'url_regex' in kwargs: raise ValueError('link parameter cannot be treated as ' 'url_regex because url_regex is already ' 'present in keyword arguments') elif link: kwargs['url_regex'] = link try: return self.find_link(*args, **kwargs) except LinkNotFoundError: if self.get_debug(): print('find_link failed for', kwargs) self.list_links() self.launch_browser() raise def follow_link(self, link=None, *bs4_args, bs4_kwargs={}, requests_kwargs={}, **kwargs): """Follow a link. If ``link`` is a bs4.element.Tag (i.e. from a previous call to :func:`links` or :func:`find_link`), then follow the link. If ``link`` doesn't have a *href*-attribute or is None, treat ``link`` as a url_regex and look it up with :func:`find_link`. ``bs4_kwargs`` are forwarded to :func:`find_link`. For backward compatibility, any excess keyword arguments (aka ``**kwargs``) are also forwarded to :func:`find_link`. If the link is not found, raise :class:`LinkNotFoundError`. Before raising, if debug is activated, list available links in the page and launch a browser. ``requests_kwargs`` are forwarded to :func:`open_relative`. :return: Forwarded from :func:`open_relative`. """ link = self._find_link_internal(link, bs4_args, {**bs4_kwargs, **kwargs}) requests_kwargs = self._merge_referer(**requests_kwargs) return self.open_relative(link['href'], **requests_kwargs) def download_link(self, link=None, file=None, *bs4_args, bs4_kwargs={}, requests_kwargs={}, **kwargs): """Downloads the contents of a link to a file. This function behaves similarly to :func:`follow_link`, but the browser state will not change when calling this function. :param file: Filesystem path where the page contents will be downloaded. If the file already exists, it will be overwritten. Other arguments are the same as :func:`follow_link` (``link`` can either be a bs4.element.Tag or a URL regex. ``bs4_kwargs`` arguments are forwarded to :func:`find_link`, as are any excess keyword arguments (aka ``**kwargs``) for backwards compatibility). :return: `requests.Response `__ object. """ link = self._find_link_internal(link, bs4_args, {**bs4_kwargs, **kwargs}) url = self.absolute_url(link['href']) requests_kwargs = self._merge_referer(**requests_kwargs) response = self.session.get(url, **requests_kwargs) if self.raise_on_404 and response.status_code == 404: raise LinkNotFoundError() # Save the response content to file if file is not None: with open(file, 'wb') as f: f.write(response.content) return response def launch_browser(self, soup=None): """Launch a browser to display a page, for debugging purposes. :param: soup: Page contents to display, supplied as a bs4 soup object. Defaults to the current page of the ``StatefulBrowser`` instance. """ if soup is None: soup = self.page super().launch_browser(soup) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/mechanicalsoup/utils.py0000644000175100001720000000130314451070517020514 0ustar00runnerdockerclass LinkNotFoundError(Exception): """Exception raised when mechanicalsoup fails to find something. This happens in situations like (non-exhaustive list): * :func:`~mechanicalsoup.StatefulBrowser.find_link` is called, but no link is found. * The browser was configured with raise_on_404=True and a 404 error is triggered while browsing. * The user tried to fill-in a field which doesn't exist in a form (e.g. browser["name"] = "val" with browser being a StatefulBrowser). """ pass def is_multipart_file_upload(form, tag): return ( form.get("enctype", "") == "multipart/form-data" and tag.get("type", "").lower() == "file" ) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/requirements.txt0000644000175100001720000000005614451070517017277 0ustar00runnerdockerrequests >= 2.22.0 beautifulsoup4 >= 4.7 lxml ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7319238 MechanicalSoup-1.3.0/setup.cfg0000644000175100001720000000042414451070531015627 0ustar00runnerdocker[aliases] test = pytest [tool:pytest] addopts = --cov --cov-config .coveragerc --flake8 -v flake8-ignore = docs/*.py ALL python_files = tests/*.py [build_sphinx] source-dir = docs/ build-dir = docs/_build all-files = 1 fresh-env = 1 [egg_info] tag_build = tag_date = 0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/setup.py0000644000175100001720000000556114451070517015533 0ustar00runnerdockerimport re import sys from codecs import open # To use a consistent encoding from os import path from setuptools import setup # Always prefer setuptools over distutils def requirements_from_file(filename): """Parses a pip requirements file into a list.""" with open(filename, 'r') as fd: return [line.strip() for line in fd if line.strip() and not line.strip().startswith('--')] def read(fname, URL, URLImage): """Read the content of a file.""" with open(path.join(path.dirname(__file__), fname)) as fd: readme = fd.read() if hasattr(readme, 'decode'): # In Python 3, turn bytes into str. readme = readme.decode('utf8') # turn relative links into absolute ones readme = re.sub(r'`<([^>]*)>`__', r'`\1 <' + URL + r"/blob/main/\1>`__", readme) readme = re.sub(r"\.\. image:: /", ".. image:: " + URLImage + "/", readme) return readme here = path.abspath(path.dirname(__file__)) about = {} with open(path.join(here, 'mechanicalsoup', '__version__.py'), 'r', 'utf-8') as fd: exec(fd.read(), about) # Don't install pytest-runner on every setup.py run, just for tests. # See https://pypi.org/project/pytest-runner/#conditional-requirement needs_pytest = {'pytest', 'test', 'ptr'}.intersection(sys.argv) pytest_runner = ['pytest-runner'] if needs_pytest else [] setup( name=about['__title__'], # useful: python setup.py sdist bdist_wheel upload version=about['__version__'], description=about['__description__'], long_description=read('README.rst', about['__github_url__'], about[ '__github_assets_absoluteURL__']), url=about['__url__'], project_urls={ 'Source': about['__github_url__'], }, license=about['__license__'], python_requires='>=3.6', classifiers=[ 'License :: OSI Approved :: MIT License', # Specify the Python versions you support here. In particular, ensure # that you indicate whether you support Python 2, Python 3 or both. 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.6', 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9', 'Programming Language :: Python :: 3.10', 'Programming Language :: Python :: 3.11', 'Programming Language :: Python :: 3 :: Only', ], packages=['mechanicalsoup'], # List run-time dependencies here. These will be installed by pip # when your project is installed. For an analysis of # "install_requires" vs pip's requirements files see: # https://packaging.python.org/en/latest/requirements.html install_requires=requirements_from_file('requirements.txt'), setup_requires=pytest_runner, tests_require=requirements_from_file('tests/requirements.txt'), ) ././@PaxHeader0000000000000000000000000000003400000000000010212 xustar0028 mtime=1688498520.7319238 MechanicalSoup-1.3.0/tests/0000755000175100001720000000000014451070531015150 5ustar00runnerdocker././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/requirements.txt0000644000175100001720000000017314451070517020441 0ustar00runnerdockerflake8 < 5.0.0 pytest >= 3.1.0 pytest-cov pytest-flake8 pytest-httpbin pytest-mock requests_mock >= 1.3.0 werkzeug < 2.1.0 ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/setpath.py0000644000175100001720000000037714451070517017205 0ustar00runnerdocker"""Add the main directory of the project to sys.path, so that uninstalled version is tested.""" import os import sys TEST_DIR = os.path.abspath(os.path.dirname(__file__)) PROJ_DIR = os.path.dirname(TEST_DIR) sys.path.insert(0, os.path.join(PROJ_DIR)) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/test_browser.py0000644000175100001720000002740714451070517020262 0ustar00runnerdockerimport os import sys import tempfile import pytest import setpath # noqa:F401, must come before 'import mechanicalsoup' from bs4 import BeautifulSoup from requests.cookies import RequestsCookieJar from utils import mock_get, prepare_mock_browser import mechanicalsoup def test_submit_online(httpbin): """Complete and submit the pizza form at http://httpbin.org/forms/post """ browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = page.soup.form form.find("input", {"name": "custname"})["value"] = "Philip J. Fry" # leave custtel blank without value assert "value" not in form.find("input", {"name": "custtel"}).attrs form.find("input", {"name": "size", "value": "medium"})["checked"] = "" form.find("input", {"name": "topping", "value": "cheese"})["checked"] = "" form.find("input", {"name": "topping", "value": "onion"})["checked"] = "" form.find("textarea", {"name": "comments"}).insert(0, "freezer") response = browser.submit(form, page.url) # helpfully the form submits to http://httpbin.org/post which simply # returns the request headers in json format json = response.json() data = json["form"] assert data["custname"] == "Philip J. Fry" assert data["custtel"] == "" # web browser submits "" for input left blank assert data["size"] == "medium" assert data["topping"] == ["cheese", "onion"] assert data["comments"] == "freezer" assert json["headers"]["User-Agent"].startswith('python-requests/') assert 'MechanicalSoup' in json["headers"]["User-Agent"] def test_get_request_kwargs(httpbin): """Return kwargs without a submit""" browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = page.soup.form form.find("input", {"name": "custname"})["value"] = "Philip J. Fry" request_kwargs = browser.get_request_kwargs(form, page.url) assert "method" in request_kwargs assert "url" in request_kwargs assert "data" in request_kwargs assert ("custname", "Philip J. Fry") in request_kwargs["data"] def test_get_request_kwargs_when_method_is_in_kwargs(httpbin): """Raise TypeError exception""" browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = page.soup.form kwargs = {"method": "post"} with pytest.raises(TypeError): browser.get_request_kwargs(form, page.url, **kwargs) def test_get_request_kwargs_when_url_is_in_kwargs(httpbin): """Raise TypeError exception""" browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = page.soup.form kwargs = {"url": httpbin + "/forms/post"} with pytest.raises(TypeError): # pylint: disable=redundant-keyword-arg browser.get_request_kwargs(form, page.url, **kwargs) def test__request(httpbin): form_html = f"""
Pizza Size

Small

Medium

Large

Pizza Toppings

Bacon

Extra Cheese

Onion

Mushroom

""" form = BeautifulSoup(form_html, "lxml").form browser = mechanicalsoup.Browser() response = browser._request(form) data = response.json()['form'] assert data["customer"] == "Philip J. Fry" assert data["telephone"] == "555" assert data["comments"] == "freezer" assert data["size"] == "medium" assert data["topping"] == ["bacon", "onion"] assert data["shape"] == "square" assert "application/x-www-form-urlencoded" in response.request.headers[ "Content-Type"] valid_enctypes_file_submit = {"multipart/form-data": True, "application/x-www-form-urlencoded": False } default_enctype = "application/x-www-form-urlencoded" @pytest.mark.parametrize("file_field", [ """""", ""]) @pytest.mark.parametrize("submit_file", [ True, False ]) @pytest.mark.parametrize("enctype", [ pytest.param("multipart/form-data"), pytest.param("application/x-www-form-urlencoded"), pytest.param("Invalid enctype") ]) def test_enctype_and_file_submit(httpbin, enctype, submit_file, file_field): # test if enctype is respected when specified # and if files are processed correctly form_html = f"""
{file_field}
""" form = BeautifulSoup(form_html, "lxml").form valid_enctype = (enctype in valid_enctypes_file_submit and valid_enctypes_file_submit[enctype]) expected_content = b"" # default if submit_file and file_field: # create a temporary file for testing file upload file_content = b":-)" pic_filedescriptor, pic_path = tempfile.mkstemp() pic_filename = os.path.basename(pic_path) os.write(pic_filedescriptor, file_content) os.close(pic_filedescriptor) if valid_enctype: # Correct encoding => send the content expected_content = file_content else: # Encoding doesn't allow sending the content, we expect # the filename as a normal text field. expected_content = os.path.basename(pic_path.encode()) tag = form.find("input", {"name": "pic"}) tag["value"] = open(pic_path, "rb") browser = mechanicalsoup.Browser() response = browser._request(form) if enctype not in valid_enctypes_file_submit: expected_enctype = default_enctype else: expected_enctype = enctype assert expected_enctype in response.request.headers["Content-Type"] resp = response.json() assert resp["form"]["in"] == "test" found = False found_in = None for key, value in resp.items(): if value: if "pic" in value: content = value["pic"].encode() assert not found assert key in ("files", "form") found = True found_in = key if key == "files" and not valid_enctype: assert not value assert found == bool(file_field) if file_field: assert content == expected_content if valid_enctype: assert found_in == "files" if submit_file: assert ("filename=\"" + pic_filename + "\"" ).encode() in response.request.body else: assert b"filename=\"\"" in response.request.body else: assert found_in == "form" if submit_file and file_field: os.remove(pic_path) def test__request_select_none(httpbin): """Make sure that a """ form = BeautifulSoup(form_html, "lxml").form browser = mechanicalsoup.Browser() response = browser._request(form) assert response.json()['form'] == {'shape': 'round'} def test__request_disabled_attr(httpbin): """Make sure that disabled form controls are not submitted.""" form_html = f"""
""" browser = mechanicalsoup.Browser() response = browser._request(BeautifulSoup(form_html, "lxml").form) assert response.json()['form'] == {} @pytest.mark.parametrize("keyword", [ pytest.param('method'), pytest.param('url'), ]) def test_request_keyword_error(keyword): """Make sure exception is raised if kwargs duplicates an arg.""" form_html = "
" browser = mechanicalsoup.Browser() with pytest.raises(TypeError, match="multiple values for"): browser._request(BeautifulSoup(form_html, "lxml").form, 'myurl', **{keyword: 'somevalue'}) def test_no_404(httpbin): browser = mechanicalsoup.Browser() resp = browser.get(httpbin + "/nosuchpage") assert resp.status_code == 404 def test_404(httpbin): browser = mechanicalsoup.Browser(raise_on_404=True) with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.get(httpbin + "/nosuchpage") resp = browser.get(httpbin.url) assert resp.status_code == 200 def test_set_cookiejar(httpbin): """Set cookies locally and test that they are received remotely.""" # construct a phony cookiejar and attach it to the session jar = RequestsCookieJar() jar.set('field', 'value') assert jar.get('field') == 'value' browser = mechanicalsoup.Browser() browser.set_cookiejar(jar) resp = browser.get(httpbin + "/cookies") assert resp.json() == {'cookies': {'field': 'value'}} def test_get_cookiejar(httpbin): """Test that cookies set by the remote host update our session.""" browser = mechanicalsoup.Browser() resp = browser.get(httpbin + "/cookies/set?k1=v1&k2=v2") assert resp.json() == {'cookies': {'k1': 'v1', 'k2': 'v2'}} jar = browser.get_cookiejar() assert jar.get('k1') == 'v1' assert jar.get('k2') == 'v2' def test_post(httpbin): browser = mechanicalsoup.Browser() data = {'color': 'blue', 'colorblind': 'True'} resp = browser.post(httpbin + "/post", data) assert resp.status_code == 200 and resp.json()['form'] == data def test_put(httpbin): browser = mechanicalsoup.Browser() data = {'color': 'blue', 'colorblind': 'True'} resp = browser.put(httpbin + "/put", data) assert resp.status_code == 200 and resp.json()['form'] == data @pytest.mark.parametrize("http_html_expected_encoding", [ pytest.param((None, 'utf-8', 'utf-8')), pytest.param(('utf-8', 'utf-8', 'utf-8')), pytest.param(('utf-8', None, 'utf-8')), pytest.param(('utf-8', 'ISO-8859-1', 'utf-8')), ]) def test_encoding(httpbin, http_html_expected_encoding): http_encoding = http_html_expected_encoding[0] html_encoding = http_html_expected_encoding[1] expected_encoding = http_html_expected_encoding[2] url = 'mock://encoding' text = ( '' + '' + ( ( 'Titleéàè' ) if html_encoding else '' ) + '' + '' ) browser, adapter = prepare_mock_browser() mock_get( adapter, url=url, reply=( text.encode(http_encoding) if http_encoding else text.encode("utf-8") ), content_type=( 'text/html' + ( ';charset=' + http_encoding if http_encoding else '' ) ) ) browser.open(url) assert browser.page.original_encoding == expected_encoding if __name__ == '__main__': pytest.main(sys.argv) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/test_form.py0000644000175100001720000004634514451070517017544 0ustar00runnerdockerimport sys import bs4 import pytest import setpath # noqa:F401, must come before 'import mechanicalsoup' from utils import setup_mock_browser import mechanicalsoup def test_construct_form_fail(): """Form objects must be constructed from form html elements.""" soup = bs4.BeautifulSoup('This is not a form', 'lxml') tag = soup.find('notform') assert isinstance(tag, bs4.element.Tag) with pytest.warns(FutureWarning, match="from a 'notform'"): mechanicalsoup.Form(tag) def test_submit_online(httpbin): """Complete and submit the pizza form at http://httpbin.org/forms/post """ browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = mechanicalsoup.Form(page.soup.form) input_data = {"custname": "Philip J. Fry"} form.input(input_data) check_data = {"size": "large", "topping": ["cheese"]} form.check(check_data) check_data = {"size": "medium", "topping": "onion"} form.check(check_data) form.textarea({"comments": "warm"}) form.textarea({"comments": "actually, no, not warm"}) form.textarea({"comments": "freezer"}) response = browser.submit(form, page.url) # helpfully the form submits to http://httpbin.org/post which simply # returns the request headers in json format json = response.json() data = json["form"] assert data["custname"] == "Philip J. Fry" assert data["custtel"] == "" # web browser submits "" for input left blank assert data["size"] == "medium" assert data["topping"] == ["cheese", "onion"] assert data["comments"] == "freezer" def test_submit_set(httpbin): """Complete and submit the pizza form at http://httpbin.org/forms/post """ browser = mechanicalsoup.Browser() page = browser.get(httpbin + "/forms/post") form = mechanicalsoup.Form(page.soup.form) form["custname"] = "Philip J. Fry" form["size"] = "medium" form["topping"] = ("cheese", "onion") form["comments"] = "freezer" response = browser.submit(form, page.url) # helpfully the form submits to http://httpbin.org/post which simply # returns the request headers in json format json = response.json() data = json["form"] assert data["custname"] == "Philip J. Fry" assert data["custtel"] == "" # web browser submits "" for input left blank assert data["size"] == "medium" assert data["topping"] == ["cheese", "onion"] assert data["comments"] == "freezer" @pytest.mark.parametrize("expected_post", [ pytest.param( [ ('text', 'Setting some text!'), ('comment', 'Testing preview page'), ('preview', 'Preview Page'), ], id='preview'), pytest.param( [ ('text', '= Heading =\n\nNew page here!\n'), ('comment', 'Created new page'), ('save', 'Submit changes'), ], id='save'), pytest.param( [ ('text', '= Heading =\n\nNew page here!\n'), ('comment', 'Testing choosing cancel button'), ('cancel', 'Cancel'), ], id='cancel'), ]) def test_choose_submit(expected_post): browser, url = setup_mock_browser(expected_post=expected_post) browser.open(url) form = browser.select_form('#choose-submit-form') browser['text'] = dict(expected_post)['text'] browser['comment'] = dict(expected_post)['comment'] form.choose_submit(expected_post[2][0]) res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' @pytest.mark.parametrize("value", [ pytest.param('continue', id='first'), pytest.param('cancel', id='second'), ]) def test_choose_submit_from_selector(value): """Test choose_submit by passing a CSS selector argument.""" text = """
""" browser, url = setup_mock_browser(expected_post=[('do', value)], text=text) browser.open(url) form = browser.select_form() submits = form.form.select(f'input[value="{value}"]') assert len(submits) == 1 form.choose_submit(submits[0]) res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' choose_submit_fail_form = '''
''' @pytest.mark.parametrize("select_name", [ pytest.param({'name': 'does_not_exist', 'fails': True}, id='not found'), pytest.param({'name': 'test_submit', 'fails': False}, id='found'), ]) def test_choose_submit_fail(select_name): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page(choose_submit_fail_form) form = browser.select_form('#choose-submit-form') if select_name['fails']: with pytest.raises(mechanicalsoup.utils.LinkNotFoundError): form.choose_submit(select_name['name']) else: form.choose_submit(select_name['name']) def test_choose_submit_twice(): """Test that calling choose_submit twice fails.""" text = '''
''' soup = bs4.BeautifulSoup(text, 'lxml') form = mechanicalsoup.Form(soup.form) form.choose_submit('test1') expected_msg = 'Submit already chosen. Cannot change submit!' with pytest.raises(Exception, match=expected_msg): form.choose_submit('test2') choose_submit_multiple_match_form = '''
''' def test_choose_submit_multiple_match(): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page(choose_submit_multiple_match_form) form = browser.select_form('#choose-submit-form') with pytest.raises(mechanicalsoup.utils.LinkNotFoundError): form.choose_submit('test_submit') submit_form_noaction = '''
''' def test_form_noaction(): browser, url = setup_mock_browser() browser.open_fake_page(submit_form_noaction, url=url) form = browser.select_form('#choose-submit-form') form['text1'] = 'newText1' res = browser.submit_selected() assert res.status_code == 200 and browser.url == url submit_form_action = '''
''' def test_form_action(): browser, url = setup_mock_browser() # for info about example.com see: https://tools.ietf.org/html/rfc2606 browser.open_fake_page(submit_form_action, url="http://example.com/invalid/") form = browser.select_form('#choose-submit-form') form['text1'] = 'newText1' res = browser.submit_selected() assert res.status_code == 200 and browser.url == url set_select_form = '''
''' @pytest.mark.parametrize("option", [ pytest.param({'result': [('entree', 'tofu')], 'default': True}, id='default'), pytest.param({'result': [('entree', 'curry')], 'default': False}, id='selected'), ]) def test_set_select(option): '''Test the branch of Form.set that finds "select" elements.''' browser, url = setup_mock_browser(expected_post=option['result'], text=set_select_form) browser.open(url) browser.select_form('form') if not option['default']: browser[option['result'][0][0]] = option['result'][0][1] res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' set_select_multiple_form = '''
''' @pytest.mark.parametrize("options", [ pytest.param('bass', id='select one (str)'), pytest.param(('bass',), id='select one (tuple)'), pytest.param(('piano', 'violin'), id='select two'), ]) def test_set_select_multiple(options): """Test a or # . Normalize before comparing. out = out.replace('>', '/>') assert out == """ """ assert err == "" page_with_radio = '''
This is a checkbox
''' def test_form_check_uncheck(): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page(page_with_radio, url="http://example.com/invalid/") form = browser.select_form('form') assert "checked" not in form.form.find("input", {"name": "foo"}).attrs form["foo"] = True assert form.form.find("input", {"name": "foo"}).attrs["checked"] == "" # Test explicit unchecking (skipping the call to Form.uncheck_all) form.set_checkbox({"foo": False}, uncheck_other_boxes=False) assert "checked" not in form.form.find("input", {"name": "foo"}).attrs page_with_various_fields = '''
Pizza Toppings

Small

Medium

Large

''' def test_form_print_summary(capsys): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page(page_with_various_fields, url="http://example.com/invalid/") browser.select_form("form") browser.form.print_summary() out, err = capsys.readouterr() # Different versions of bs4 show either or # . Normalize before comparing. out = out.replace('>', '/>') assert out == """ """ assert err == "" def test_issue180(): """Test that a KeyError is not raised when Form.choose_submit is called on a form where a submit element is missing its name-attribute.""" browser = mechanicalsoup.StatefulBrowser() html = '''
''' browser.open_fake_page(html) form = browser.select_form() with pytest.raises(mechanicalsoup.utils.LinkNotFoundError): form.choose_submit('not_found') def test_issue158(): """Test that form elements are processed in their order on the page and that elements with duplicate name-attributes are not clobbered.""" issue158_form = '''
''' expected_post = [('box', '1'), ('box', '2'), ('box', '0')] browser, url = setup_mock_browser(expected_post=expected_post, text=issue158_form) browser.open(url) browser.select_form() res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' browser.close() def test_duplicate_submit_buttons(): """Tests that duplicate submits doesn't break form submissions See issue https://github.com/MechanicalSoup/MechanicalSoup/issues/264""" issue264_form = '''
''' expected_post = [('box', '1'), ('search', 'Search')] browser, url = setup_mock_browser(expected_post=expected_post, text=issue264_form) browser.open(url) browser.select_form() res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' browser.close() @pytest.mark.parametrize("expected_post", [ pytest.param([('sub2', 'val2')], id='submit button'), pytest.param([('sub4', 'val4')], id='typeless button'), pytest.param([('sub5', 'val5')], id='submit input'), ]) def test_choose_submit_buttons(expected_post): """Buttons of type reset and button are not valid submits""" text = """
""" browser, url = setup_mock_browser(expected_post=expected_post, text=text) browser.open(url) browser.select_form() res = browser.submit_selected(btnName=expected_post[0][0]) assert res.status_code == 200 and res.text == 'Success!' @pytest.mark.parametrize("fail, selected, expected_post", [ pytest.param(False, 'with_value', [('selector', 'with_value')], id='Option with value'), pytest.param(False, 'Without value', [('selector', 'Without value')], id='Option without value'), pytest.param(False, 'We have a value here', [('selector', 'with_value')], id='Option with value selected by its text'), pytest.param(True, 'Unknown option', None, id='Unknown option, must raise a LinkNotFound exception') ]) def test_option_without_value(fail, selected, expected_post): """Option tag in select can have no value option""" text = """
""" browser, url = setup_mock_browser(expected_post=expected_post, text=text) browser.open(url) browser.select_form() if fail: with pytest.raises(mechanicalsoup.utils.LinkNotFoundError): browser['selector'] = selected else: browser['selector'] = selected res = browser.submit_selected() assert res.status_code == 200 and res.text == 'Success!' if __name__ == '__main__': pytest.main(sys.argv) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/test_stateful_browser.py0000644000175100001720000007604714451070517022175 0ustar00runnerdockerimport copy import json import os import re import sys import tempfile import webbrowser import pytest import setpath # noqa:F401, must come before 'import mechanicalsoup' from bs4 import BeautifulSoup from utils import (mock_get, open_legacy_httpbin, prepare_mock_browser, setup_mock_browser) import mechanicalsoup import requests def test_request_forward(): data = [('var1', 'val1'), ('var2', 'val2')] browser, url = setup_mock_browser(expected_post=data) r = browser.request('POST', url + '/post', data=data) assert r.text == 'Success!' def test_properties(): """Check that properties return the same value as the getter.""" browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page('
', url="http://example.com") assert browser.page == browser.get_current_page() assert browser.page is not None assert browser.url == browser.get_url() assert browser.url is not None browser.select_form() assert browser.form == browser.get_current_form() assert browser.form is not None def test_get_selected_form_unselected(): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page('
') with pytest.raises(AttributeError, match="No form has been selected yet."): browser.form assert browser.get_current_form() is None def test_submit_online(httpbin): """Complete and submit the pizza form at http://httpbin.org/forms/post """ browser = mechanicalsoup.StatefulBrowser() browser.set_user_agent('testing MechanicalSoup') browser.open(httpbin.url) for link in browser.links(): if link["href"] == "/": browser.follow_link(link) break browser.follow_link("forms/post") assert browser.url == httpbin + "/forms/post" browser.select_form("form") browser["custname"] = "Customer Name Here" browser["size"] = "medium" browser["topping"] = ("cheese", "bacon") # Change our mind to make sure old boxes are unticked browser["topping"] = ("cheese", "onion") browser["comments"] = "Some comment here" browser.form.set("nosuchfield", "new value", True) response = browser.submit_selected() json = response.json() data = json["form"] assert data["custname"] == "Customer Name Here" assert data["custtel"] == "" # web browser submits "" for input left blank assert data["size"] == "medium" assert set(data["topping"]) == {"cheese", "onion"} assert data["comments"] == "Some comment here" assert data["nosuchfield"] == "new value" assert json["headers"]["User-Agent"] == 'testing MechanicalSoup' # Ensure we haven't blown away any regular headers expected_headers = ('Content-Length', 'Host', 'Content-Type', 'Connection', 'Accept', 'User-Agent', 'Accept-Encoding') assert set(expected_headers).issubset(json["headers"].keys()) def test_no_404(httpbin): browser = mechanicalsoup.StatefulBrowser() resp = browser.open(httpbin + "/nosuchpage") assert resp.status_code == 404 def test_404(httpbin): browser = mechanicalsoup.StatefulBrowser(raise_on_404=True) with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.open(httpbin + "/nosuchpage") resp = browser.open(httpbin.url) assert resp.status_code == 200 def test_user_agent(httpbin): browser = mechanicalsoup.StatefulBrowser(user_agent='007') resp = browser.open(httpbin + "/user-agent") assert resp.json() == {'user-agent': '007'} def test_open_relative(httpbin): # Open an arbitrary httpbin page to set the current URL browser = mechanicalsoup.StatefulBrowser() browser.open(httpbin + "/html") # Open a relative page and make sure remote host and browser agree on URL resp = browser.open_relative("/get") assert resp.json()['url'] == httpbin + "/get" assert browser.url == httpbin + "/get" # Test passing additional kwargs to the session resp = browser.open_relative("/basic-auth/me/123", auth=('me', '123')) assert browser.url == httpbin + "/basic-auth/me/123" assert resp.json() == {"authenticated": True, "user": "me"} def test_links(): browser = mechanicalsoup.StatefulBrowser() html = '''A Blue Link A Red Link''' expected = [BeautifulSoup(html, "lxml").a] browser.open_fake_page(html) # Test StatefulBrowser.links url_regex argument assert browser.links(url_regex="bl") == expected assert browser.links(url_regex="bluish") == [] # Test StatefulBrowser.links link_text argument assert browser.links(link_text="A Blue Link") == expected assert browser.links(link_text="Blue") == [] # Test StatefulBrowser.links kwargs passed to BeautifulSoup.find_all assert browser.links(string=re.compile('Blue')) == expected assert browser.links(class_="bluelink") == expected assert browser.links(id="blue_link") == expected assert browser.links(id="blue") == [] # Test returning a non-singleton two_links = browser.links(id=re.compile('_link')) assert len(two_links) == 2 assert two_links == BeautifulSoup(html, "lxml").find_all('a') @pytest.mark.parametrize("expected_post", [ pytest.param( [ ('text', 'Setting some text!'), ('comment', 'Selecting an input submit'), ('diff', 'Review Changes'), ], id='input'), pytest.param( [ ('text', '= Heading =\n\nNew page here!\n'), ('comment', 'Selecting a button submit'), ('cancel', 'Cancel'), ], id='button'), ]) def test_submit_btnName(expected_post): '''Tests that the btnName argument chooses the submit button.''' browser, url = setup_mock_browser(expected_post=expected_post) browser.open(url) browser.select_form('#choose-submit-form') browser['text'] = dict(expected_post)['text'] browser['comment'] = dict(expected_post)['comment'] initial_state = browser._StatefulBrowser__state res = browser.submit_selected(btnName=expected_post[2][0]) assert res.status_code == 200 and res.text == 'Success!' assert initial_state != browser._StatefulBrowser__state @pytest.mark.parametrize("expected_post", [ pytest.param( [ ('text', 'Setting some text!'), ('comment', 'Selecting an input submit'), ], id='input'), pytest.param( [ ('text', '= Heading =\n\nNew page here!\n'), ('comment', 'Selecting a button submit'), ], id='button'), ]) def test_submit_no_btn(expected_post): '''Tests that no submit inputs are posted when btnName=False.''' browser, url = setup_mock_browser(expected_post=expected_post) browser.open(url) browser.select_form('#choose-submit-form') browser['text'] = dict(expected_post)['text'] browser['comment'] = dict(expected_post)['comment'] initial_state = browser._StatefulBrowser__state res = browser.submit_selected(btnName=False) assert res.status_code == 200 and res.text == 'Success!' assert initial_state != browser._StatefulBrowser__state def test_submit_dont_modify_kwargs(): """Test that submit_selected() doesn't modify the caller's passed-in kwargs, for example when adding a Referer header. """ kwargs = {'headers': {'Content-Type': 'text/html'}} saved_kwargs = copy.deepcopy(kwargs) browser, url = setup_mock_browser(expected_post=[], text='
') browser.open(url) browser.select_form() browser.submit_selected(**kwargs) assert kwargs == saved_kwargs def test_submit_dont_update_state(): expected_post = [ ('text', 'Bananas are good.'), ('preview', 'Preview Page')] browser, url = setup_mock_browser(expected_post=expected_post) browser.open(url) browser.select_form('#choose-submit-form') browser['text'] = dict(expected_post)['text'] initial_state = browser._StatefulBrowser__state res = browser.submit_selected(update_state=False) assert res.status_code == 200 and res.text == 'Success!' assert initial_state == browser._StatefulBrowser__state def test_get_set_debug(): browser = mechanicalsoup.StatefulBrowser() # Debug mode is off by default assert not browser.get_debug() browser.set_debug(True) assert browser.get_debug() def test_list_links(capsys): # capsys is a pytest fixture that allows us to inspect the std{err,out} browser = mechanicalsoup.StatefulBrowser() links = ''' Link #1 Link #2 ''' browser.open_fake_page(f'{links}') browser.list_links() out, err = capsys.readouterr() expected = f'Links in the current page:{links}' assert out == expected def test_launch_browser(mocker): browser = mechanicalsoup.StatefulBrowser() browser.set_debug(True) browser.open_fake_page('') mocker.patch('webbrowser.open') with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.follow_link('nosuchlink') # mock.assert_called_once() not available on some versions :-( assert webbrowser.open.call_count == 1 mocker.resetall() with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.select_form('nosuchlink') # mock.assert_called_once() not available on some versions :-( assert webbrowser.open.call_count == 1 def test_find_link(): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page('') with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.find_link('nosuchlink') def test_verbose(capsys): '''Tests that the btnName argument chooses the submit button.''' browser, url = setup_mock_browser() browser.open(url) out, err = capsys.readouterr() assert out == "" assert err == "" assert browser.get_verbose() == 0 browser.set_verbose(1) browser.open(url) out, err = capsys.readouterr() assert out == "." assert err == "" assert browser.get_verbose() == 1 browser.set_verbose(2) browser.open(url) out, err = capsys.readouterr() assert out == "mock://form.com\n" assert err == "" assert browser.get_verbose() == 2 def test_new_control(httpbin): browser = mechanicalsoup.StatefulBrowser() browser.open(httpbin + "/forms/post") browser.select_form("form") with pytest.raises(mechanicalsoup.LinkNotFoundError): # The control doesn't exist, yet. browser["temperature"] = "cold" browser["size"] = "large" # Existing radio browser["comments"] = "This is a comment" # Existing textarea browser.new_control("text", "temperature", "warm") browser.new_control("textarea", "size", "Sooo big !") browser.new_control("text", "comments", "This is an override comment") fake_select = BeautifulSoup("", "html.parser").new_tag('select') fake_select["name"] = "foo" browser.form.form.append(fake_select) browser.new_control("checkbox", "foo", "valval", checked="checked") tag = browser.form.form.find("input", {"name": "foo"}) assert tag.attrs["checked"] == "checked" browser["temperature"] = "hot" response = browser.submit_selected() json = response.json() data = json["form"] print(data) assert data["temperature"] == "hot" assert data["size"] == "Sooo big !" assert data["comments"] == "This is an override comment" assert data["foo"] == "valval" submit_form_noaction = '''
''' def test_form_noaction(): browser, url = setup_mock_browser() browser.open_fake_page(submit_form_noaction) browser.select_form('#choose-submit-form') with pytest.raises(ValueError, match="no URL to submit to"): browser.submit_selected() submit_form_noname = '''
''' def test_form_noname(): browser, url = setup_mock_browser(expected_post=[]) browser.open_fake_page(submit_form_noname, url=url) browser.select_form('#choose-submit-form') response = browser.submit_selected() assert response.status_code == 200 and response.text == 'Success!' submit_form_multiple = '''
''' def test_form_multiple(): browser, url = setup_mock_browser(expected_post=[('foo', 'tofu'), ('foo', 'tempeh')]) browser.open_fake_page(submit_form_multiple, url=url) browser.select_form('#choose-submit-form') response = browser.submit_selected() assert response.status_code == 200 and response.text == 'Success!' def test_upload_file(httpbin): browser = mechanicalsoup.StatefulBrowser() url = httpbin + "/post" file_input_form = f"""
""" # Create two temporary files to upload def make_file(content): path = tempfile.mkstemp()[1] with open(path, "w") as fd: fd.write(content) return path path1 = make_file("first file content") path2 = make_file("second file content") value1 = open(path1, "rb") value2 = open(path2, "rb") browser.open_fake_page(file_input_form) browser.select_form() # Test filling an existing input and creating a new input browser["first"] = value1 browser.new_control("file", "second", value2) response = browser.submit_selected() files = response.json()["files"] assert files["first"] == "first file content" assert files["second"] == "second file content" def test_upload_file_with_malicious_default(httpbin): """Check for CVE-2023-34457 by setting the form input value directly to a file that the user does not explicitly consent to upload, as a malicious server might do. """ browser = mechanicalsoup.StatefulBrowser() sensitive_path = tempfile.mkstemp()[1] with open(sensitive_path, "w") as fd: fd.write("Some sensitive information") url = httpbin + "/post" malicious_html = f"""
""" browser.open_fake_page(malicious_html) browser.select_form() response = browser.submit_selected() assert response.json()["files"] == {"malicious": ""} def test_upload_file_raise_on_string_input(): """Check for use of the file upload API that was modified to remediate CVE-2023-34457. Users must now open files manually to upload them. """ browser = mechanicalsoup.StatefulBrowser() file_input_form = """
""" browser.open_fake_page(file_input_form) browser.select_form() with pytest.raises(ValueError, match="CVE-2023-34457"): browser["upload"] = "/path/to/file" with pytest.raises(ValueError, match="CVE-2023-34457"): browser.new_control("file", "upload2", "/path/to/file") def test_with(): """Test that __enter__/__exit__ properly create/close the browser.""" with mechanicalsoup.StatefulBrowser() as browser: assert browser.session is not None assert browser.session is None def test_select_form_nr(): """Test the nr option of select_form.""" forms = """
""" with mechanicalsoup.StatefulBrowser() as browser: browser.open_fake_page(forms) form = browser.select_form() assert form.form['id'] == "a" form = browser.select_form(nr=1) assert form.form['id'] == "b" form = browser.select_form(nr=2) assert form.form['id'] == "c" with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.select_form(nr=3) def test_select_form_tag_object(): """Test tag object as selector parameter type""" forms = """

""" soup = BeautifulSoup(forms, "lxml") with mechanicalsoup.StatefulBrowser() as browser: browser.open_fake_page(forms) form = browser.select_form(soup.find("form", {"id": "b"})) assert form.form['id'] == "b" with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.select_form(soup.find("p")) def test_select_form_associated_elements(): """Test associated elements outside the form tag""" forms = """
""" with mechanicalsoup.StatefulBrowser() as browser: browser.open_fake_page(forms) elements_form_a = set([ "", "", '', '']) elements_form_ab = set(["", '']) form_by_str = browser.select_form("#a") form_by_tag = browser.select_form(browser.page.find("form", id='a')) form_by_css = browser.select_form("form[action$='.php']") assert set([str(element) for element in form_by_str.form.find_all(( "input", "textarea"))]) == elements_form_a assert set([str(element) for element in form_by_tag.form.find_all(( "input", "textarea"))]) == elements_form_a assert set([str(element) for element in form_by_css.form.find_all(( "input", "textarea"))]) == elements_form_ab def test_referer_follow_link(httpbin): browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) start_url = browser.url response = browser.follow_link("/headers") referer = response.json()["headers"]["Referer"] actual_ref = re.sub('/*$', '', referer) expected_ref = re.sub('/*$', '', start_url) assert actual_ref == expected_ref submit_form_headers = '''
''' def test_referer_submit(httpbin): browser = mechanicalsoup.StatefulBrowser() ref = "https://example.com/my-referer" page = submit_form_headers.format(httpbin.url + "/headers") browser.open_fake_page(page, url=ref) browser.select_form() response = browser.submit_selected() headers = response.json()["headers"] referer = headers["Referer"] actual_ref = re.sub('/*$', '', referer) assert actual_ref == ref @pytest.mark.parametrize("referer_header", ["Referer", "referer"]) def test_referer_submit_override(httpbin, referer_header): """Ensure the caller can override the Referer header that mechanicalsoup would normally add. Because headers are case insensitive, test with both 'Referer' and 'referer'. """ browser = mechanicalsoup.StatefulBrowser() ref = "https://example.com/my-referer" ref_override = "https://example.com/override" page = submit_form_headers.format(httpbin.url + "/headers") browser.open_fake_page(page, url=ref) browser.select_form() response = browser.submit_selected(headers={referer_header: ref_override}) headers = response.json()["headers"] referer = headers["Referer"] actual_ref = re.sub('/*$', '', referer) assert actual_ref == ref_override def test_referer_submit_headers(httpbin): browser = mechanicalsoup.StatefulBrowser() ref = "https://example.com/my-referer" page = submit_form_headers.format(httpbin.url + "/headers") browser.open_fake_page(page, url=ref) browser.select_form() response = browser.submit_selected( headers={'X-Test-Header': 'x-test-value'}) headers = response.json()["headers"] referer = headers["Referer"] actual_ref = re.sub('/*$', '', referer) assert actual_ref == ref assert headers['X-Test-Header'] == 'x-test-value' @pytest.mark.parametrize('expected, kwargs', [ pytest.param('/foo', {}, id='none'), pytest.param('/get', {'string': 'Link'}, id='string'), pytest.param('/get', {'url_regex': 'get'}, id='regex'), ]) def test_follow_link_arg(httpbin, expected, kwargs): browser = mechanicalsoup.StatefulBrowser() html = 'BarLink' browser.open_fake_page(html, httpbin.url) browser.follow_link(bs4_kwargs=kwargs) assert browser.url == httpbin + expected def test_follow_link_excess(httpbin): """Ensure that excess args are passed to BeautifulSoup""" browser = mechanicalsoup.StatefulBrowser() html = 'BarLink' browser.open_fake_page(html, httpbin.url) browser.follow_link(url_regex='get') assert browser.url == httpbin + '/get' browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page('Link', httpbin.url) with pytest.raises(ValueError, match="link parameter cannot be .*"): browser.follow_link('foo', url_regex='bar') def test_follow_link_ua(httpbin): """Tests passing requests parameters to follow_link() by setting the User-Agent field.""" browser = mechanicalsoup.StatefulBrowser() # html = 'BarLink' # browser.open_fake_page(html, httpbin.url) open_legacy_httpbin(browser, httpbin) bs4_kwargs = {'url_regex': 'user-agent'} requests_kwargs = {'headers': {"User-Agent": '007'}} resp = browser.follow_link(bs4_kwargs=bs4_kwargs, requests_kwargs=requests_kwargs) assert browser.url == httpbin + '/user-agent' assert resp.json() == {'user-agent': '007'} assert resp.request.headers['user-agent'] == '007' def test_link_arg_multiregex(httpbin): browser = mechanicalsoup.StatefulBrowser() browser.open_fake_page('Link', httpbin.url) with pytest.raises(ValueError, match="link parameter cannot be .*"): browser.follow_link('foo', bs4_kwargs={'url_regex': 'bar'}) def file_get_contents(filename): with open(filename, "rb") as fd: return fd.read() def test_download_link(httpbin): """Test downloading the contents of a link to file.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) tmpdir = tempfile.mkdtemp() tmpfile = tmpdir + '/nosuchfile.png' current_url = browser.url current_page = browser.page response = browser.download_link(file=tmpfile, link='image/png') # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that the file was downloaded assert os.path.isfile(tmpfile) assert file_get_contents(tmpfile) == response.content # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' def test_download_link_nofile(httpbin): """Test downloading the contents of a link without saving it.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) current_url = browser.url current_page = browser.page response = browser.download_link(link='image/png') # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' def test_download_link_nofile_bs4(httpbin): """Test downloading the contents of a link without saving it.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) current_url = browser.url current_page = browser.page response = browser.download_link(bs4_kwargs={'url_regex': 'image.png'}) # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' def test_download_link_nofile_excess(httpbin): """Test downloading the contents of a link without saving it.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) current_url = browser.url current_page = browser.page response = browser.download_link(url_regex='image.png') # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' def test_download_link_nofile_ua(httpbin): """Test downloading the contents of a link without saving it.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) current_url = browser.url current_page = browser.page requests_kwargs = {'headers': {"User-Agent": '007'}} response = browser.download_link(link='image/png', requests_kwargs=requests_kwargs) # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' # Check that we actually set the User-agent outbound assert response.request.headers['user-agent'] == '007' def test_download_link_to_existing_file(httpbin): """Test downloading the contents of a link to an existing file.""" browser = mechanicalsoup.StatefulBrowser() open_legacy_httpbin(browser, httpbin) tmpdir = tempfile.mkdtemp() tmpfile = tmpdir + '/existing.png' with open(tmpfile, "w") as fd: fd.write("initial content") current_url = browser.url current_page = browser.page response = browser.download_link('image/png', tmpfile) # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that the file was downloaded assert os.path.isfile(tmpfile) assert file_get_contents(tmpfile) == response.content # Check that we actually downloaded a PNG file assert response.content[:4] == b'\x89PNG' def test_download_link_404(httpbin): """Test downloading the contents of a broken link.""" browser = mechanicalsoup.StatefulBrowser(raise_on_404=True) browser.open_fake_page('Link', url=httpbin.url) tmpdir = tempfile.mkdtemp() tmpfile = tmpdir + '/nosuchfile.txt' current_url = browser.url current_page = browser.page with pytest.raises(mechanicalsoup.LinkNotFoundError): browser.download_link(file=tmpfile, link_text='Link') # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that the file was not downloaded assert not os.path.exists(tmpfile) def test_download_link_referer(httpbin): """Test downloading the contents of a link to file.""" browser = mechanicalsoup.StatefulBrowser() ref = httpbin + "/my-referer" browser.open_fake_page('Link', url=ref) tmpfile = tempfile.NamedTemporaryFile() current_url = browser.url current_page = browser.page browser.download_link(file=tmpfile.name, link_text='Link') # Check that the browser state has not changed assert browser.url == current_url assert browser.page == current_page # Check that the file was downloaded with open(tmpfile.name) as fd: json_data = json.load(fd) headers = json_data["headers"] assert headers["Referer"] == ref def test_refresh_open(): url = 'mock://example.com' initial_page = BeautifulSoup('

Fake empty page

', 'lxml') reload_page = BeautifulSoup('

Fake reloaded page

', 'lxml') browser, adapter = prepare_mock_browser() mock_get(adapter, url=url, reply=str(initial_page)) browser.open(url) mock_get(adapter, url=url, reply=str(reload_page), additional_matcher=lambda r: 'Referer' not in r.headers) browser.refresh() assert browser.url == url assert browser.page == reload_page def test_refresh_follow_link(): url = 'mock://example.com' follow_url = 'mock://example.com/followed' initial_content = f'Link' initial_page = BeautifulSoup(initial_content, 'lxml') reload_page = BeautifulSoup('

Fake reloaded page

', 'lxml') browser, adapter = prepare_mock_browser() mock_get(adapter, url=url, reply=str(initial_page)) mock_get(adapter, url=follow_url, reply=str(initial_page)) browser.open(url) browser.follow_link() refer_header = {'Referer': url} mock_get(adapter, url=follow_url, reply=str(reload_page), request_headers=refer_header) browser.refresh() assert browser.url == follow_url assert browser.page == reload_page def test_refresh_form_not_retained(): url = 'mock://example.com' initial_content = '
Here comes the form
' initial_page = BeautifulSoup(initial_content, 'lxml') reload_page = BeautifulSoup('

Fake reloaded page

', 'lxml') browser, adapter = prepare_mock_browser() mock_get(adapter, url=url, reply=str(initial_page)) browser.open(url) browser.select_form() mock_get(adapter, url=url, reply=str(reload_page), additional_matcher=lambda r: 'Referer' not in r.headers) browser.refresh() assert browser.url == url assert browser.page == reload_page with pytest.raises(AttributeError, match="No form has been selected yet."): browser.form def test_refresh_error(): browser = mechanicalsoup.StatefulBrowser() # Test no page with pytest.raises(ValueError): browser.refresh() # Test fake page with pytest.raises(ValueError): browser.open_fake_page('

Fake empty page

', url='http://fake.com') browser.refresh() def test_requests_session_and_cookies(httpbin): """Check that the session object passed to the constructor of StatefulBrowser is actually taken into account.""" s = requests.Session() requests.utils.add_dict_to_cookiejar(s.cookies, {'key1': 'val1'}) browser = mechanicalsoup.StatefulBrowser(session=s) resp = browser.get(httpbin + "/cookies") assert resp.json() == {'cookies': {'key1': 'val1'}} if __name__ == '__main__': pytest.main(sys.argv) ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/test_utils.py0000644000175100001720000000041414451070517017724 0ustar00runnerdockerimport pytest import mechanicalsoup def test_LinkNotFoundError(): with pytest.raises(mechanicalsoup.LinkNotFoundError): raise mechanicalsoup.utils.LinkNotFoundError with pytest.raises(Exception): raise mechanicalsoup.utils.LinkNotFoundError ././@PaxHeader0000000000000000000000000000002600000000000010213 xustar0022 mtime=1688498511.0 MechanicalSoup-1.3.0/tests/utils.py0000644000175100001720000000571214451070517016673 0ustar00runnerdockerfrom urllib.parse import parse_qsl import requests_mock import mechanicalsoup """ Utilities for testing MechanicalSoup. """ choose_submit_form = '''
''' def setup_mock_browser(expected_post=None, text=choose_submit_form): url = 'mock://form.com' browser, mock = prepare_mock_browser() mock_get(mock, url, text) if expected_post is not None: mock_post(mock, url + '/post', expected_post) return browser, url def prepare_mock_browser(scheme='mock'): mock = requests_mock.Adapter() browser = mechanicalsoup.StatefulBrowser(requests_adapters={scheme: mock}) return browser, mock def mock_get(mocked_adapter, url, reply, content_type='text/html', **kwargs): headers = {'Content-Type': content_type} if isinstance(reply, str): kwargs['text'] = reply else: kwargs['content'] = reply mocked_adapter.register_uri('GET', url, headers=headers, **kwargs) def mock_post(mocked_adapter, url, expected, reply='Success!'): def text_callback(request, context): query = parse_qsl(request.text) assert query == expected return reply mocked_adapter.register_uri('POST', url, text=text_callback) class HttpbinRemote: """Drop-in replacement for pytest-httpbin's httpbin fixture that uses the remote httpbin server instead of a local one.""" def __init__(self): self.url = "http://httpbin.org" def __add__(self, x): return self.url + x def open_legacy_httpbin(browser, httpbin): """Opens the start page of httpbin (given as a fixture). Tries the legacy page (available only on recent versions of httpbin), and if it fails fall back to the main page (which is JavaScript-only in recent versions of httpbin hence usable for us only on old versions). """ try: response = browser.open(httpbin + "/legacy") if response.status_code == 404: # The line above may or may not have raised the exception # depending on raise_on_404. Raise it unconditionally now. raise mechanicalsoup.LinkNotFoundError() return response except mechanicalsoup.LinkNotFoundError: return browser.open(httpbin.url)